Rolling Retention Done Right in SQL

SaaS’s Most Important Metric: Retention

If you are a product manager or in a growth role at a SaaS company, rolling retention (also called cohort analysis) is a key metric to monitor your growth. It’s so important that every analytics service worth their salt (Google Analytics, Mixpanel, Adobe Analytics, Heap Analytics and Amplitude to name few) all have it baked into their UI. Here’s an example from Heap Analytics.

As an organization weans itself off of analytics tools and tries to build their own analytics with SQL and databases, they naturally try to calculate rolling retention using SQL. They soon realize that writing SQL to produce a nice tabular view like the one above isn’t quite straightforward.

But here’s the good news: with a bit of advanced SQL, you can calculate rolling retention against any cohort window with any behavioral condition, assuming that you collect necessary customer events. This is the ultimate reason why you want to use your own database + SQL v. canned analytics tools: the infinite flexibility in defining, tweaking and improving your retention analysis to drive growth.

Rolling Retention: Definition

Before we dive into SQL, here’s a quick review of rolling retention, also known as cohort analysis.

Rolling retention is defined as the percentage of returning users measure at a regular interval, typical weekly or monthly, grouped by their sign-up week/month, also known as cohort. By grouping users based on when they signed up, you can gain insight on how your product/marketing/sales initiatives have impacted retention: For example, suppose you had a major launch and had many sign-ups over the following few days. How well did these new users stick around compared to, say, pre-launch users that signed up a week prior? Or, say you sent out offers for a discount to dormant users from 2 months ago: How many of them came back and stayed on the product? Rolling retention helps you answer these questions at a glance.

Setting the Stage

In the rest of this article, we assume that you have a table called “logins” with just two columns: user_id and login_time. Here’s an example:

user_idlogin_time
a005baae1468571605
a005baae1468571605
a03224fb1468571605
a03224fb1468571605
a03224fb1468571605
a05e88ec1468598437
a05e88ec1468615063
a05e94521468569072

Note that we are using Unix (epoch) time for the time field, which indicates the number of seconds since Jan 1, 1970 UTC. The decision to use Unix time is a conscious one. Because Unix time is of an integer type and has a clear meaning, we can focus on the core logic and avoid reasoning about time zones and other tangential issues.

Also, I will be working with weekly cohorts. Depending on your product/internal process, you might want to adjust it to daily or monthly.

Finally, in the rest of this article, I am using Treasure Data’s flavor of Presto SQL. That said, the core logic should work on any SQL engine.

Step 1: Bucketing Visits By Week

The first step is to bucket visits into cohorts, i.e., which user loggedin into the app at all in a given week. To do this, we are grouping users by week as follows:

The query “squashes” all logins in each week into one row for each user. “TD_DATE_TRUNC” function is used to do this. Also, note that Presto SQL lets you alias column names in the GROUP BY clause with their ordinals.

Step 2: Normalizing Visits

The next step is to calculate the number of weeks between the week of the first visit and the given visits’ week. Said another way, give the following table:

user_idlogin_week
a005baae1462147200
a005baae1462752000
a03224fb1462147200
a03224fb1462752000
a03224fb1463356800
a05e88ec1462752000
a05e88ec1463356800
a05e94521462752000

You create the “first_week” and  “week_number” columns.

user_idlogin_weekfirst_weekweek_number
a005baae146214720014621472000
a005baae146275200014621472001
a03224fb146214720014621472000
a03224fb146275200014621472001
a03224fb146335680014621472002
a05e88ec146275200014627520000
a05e88ec146335680014627520001
a05e9452146275200014627520000

So, how do you do this?

The answer is to use the FIRST_VALUE window function.

First, note that we “saved” the previous query using the “WITH” query into the temporary table by_week.

Second, note how FIRST_VALUE is used. The mechanics of window functions is a whole topic to itself, but in this case, it partitions login_time by user_id, order them in ascending order by itself, and get the first value. Hence, the result of the query is:

user_idlogin_weekfirst_week
a005baae14621472001462147200
a005baae14627520001462147200
a03224fb14621472001462147200
a03224fb14627520001462147200
a03224fb14633568001462147200
a05e88ec14627520001462752000
a05e88ec14633568001462752000
a05e945214627520001462752000

From here, it’s easy to calculate week_number by subtracting login_week from first_week and dividing by 24*60*60*7 (recall that Unix time is in seconds). Hence, the final query is:

And You are done calculating week_number. Phew!

Step 3: Tallying Up

We are almost done. Now, we need to create the “pivot table” view of retention analysis. In general, most SQL engines don’t support the full pivot table functionality with dynamic columns. In our case, however, we can mimic it with the SUM function and the CASE statement because we only need to compute number of visitors for week 0 through 9 per cohort.

Here is the actually query:

Which yields:

first_weekweek_0week_1week_2...
5/1/16221...
5/8/16210...

This gives you the raw user counts for each cohort week over week. If you want to convert the columns into percentage, then divide week_1 through week_9 with week_0.

Why Do This?

If you aren’t a SQL veteran, you might have found the above queries a bit daunting. Why go through so much trouble if these cohort reports are already available in analytics tools?

Because SQL is more expressive, customizable and powerful.

Here are some known limitations that I’ve personally run into with canned analytics tools for retention analysis:

  1. Adjusting conversion windows: Many tools have a fixed/limited set of conversion windows for funnel analysis, such as “up to 180 days” or “30, 60 or 90 days”. However, the sales cycle varies from one SaaS to another, and it’s helpful to have the flexibility to adjust the conversion window.
  2. Defining custom conversion events: To be sure, analytics services do let you customize events. That said, these custom events often require explicit definitions and/or cannot be modified post facto. By collecting raw events and analyzing them in SQL, you gain full flexibility in defining and evolving your conversion events.
  3. Combining with Salesforce data to filter “bad” conversions: Not all sign-ups are created equal; some are tire-kickers, while others can turn into a multi-million dollar account. Unfortunately, these types of account/lead-level information are not captured in analytics tools, but rather in CRM tools like Salesforce or customer support tools like Zendesk. Thanks to Treasure Data’s Salesforce and Zendesk data connectors, you can bring customer metadata into your retention analysis. And it allows you to calculate deeper metrics, such as revenue retention or retention by geography or buyer persona.

If you are curious to learn how Treasure Data can take your customer analytics to the next level, read how our customers are using our service or request a product demo below.

Request a demo
Kiyoto Tamura
Kiyoto Tamura
Kiyoto began his career in quantitative finance before making a transition into the startup world. A math nerd turned software engineer turned developer marketer, he enjoys postmodern literature, statistics, and a good cup of coffee.
In Case You Missed It