Graduate from Mixpanel: Funnel Analysis with SQL and R

This post is part one of a two part series. See part two here.

What is Funnel Analysis?

In a nutshell, funnel analysis allows you to follow a user through a series of self-defined events as well as, allowing you to calculate the given conversion rates between event to event. There are multiple ways and opinions on how to define a funnel analysis.The way I’ve developed here takes the time a series of events occurred and the order of their occurrence into consideration.

Just like the real-world tool, a funnel moves objects from one area to another through a bottleneck. An example of a web-based funnel would be a user hitting your web store’s homepage, then placing an item in the basket, and finally checking out.

I prefer a funnel where it is impossible, or very difficult, to complete later stages of the funnel before going through the earlier stages. It’s a more conservative approach, which is useful if you want to improve user’s movement through a defined pathway. The analysis below assumes this type of funnel structure. It’s also useful for funnels where each step can be done out of order, but a conversion will only count if the steps were done in order, with the risk that this could lead to underestimating.

Prepping your data

For ease of using the queries I’ve defined below, I assume a simple data structure of the following schema:

 TimeUserEvent
DescriptionThe timestamp of when the event occurredThe unique user IDThe event the user performed
ExampleJanuary 01, 2016 12:00:000 GSTabc123Login
Luckily, I already have a table called events that logs every event within our website, the user, and the time, along with other information. Depending on how you have your tables set up, you might have to do some prep to get it into the above structure.

The analysis below will use a simple funnel of three events:

  • Event 1 : Land on homepage
  • Event 2 : Place item in cart
  • Event 3 : Purchase item

Calculating the Funnel

Before showing the final query, I want to quickly walk through each part of the query for ease of understanding my methodology.

Defining the Scope

I wanted to create a query that makes it easy to both define the time frame allowed from funnel start to completion, and to easily switch out the funnel events. Whether you want to look at a week of data, or the last 90 days, I wanted to easily switch out the dates. Similarly, switching out events in a funnel with the same amount of steps should be as trivial as changing filters.

Adding or subtracting events to the funnel take a bit more work, but only require careful deleting or pasting similar lines of logic.

I’ll define query_1 as below:

My goal with the above query is to find the first occurrence of when a user performed each of the events, if they performed them at all.

Notice in query_1, I have defined my timeframe as the month of December using the Treasure Data provided UDF td_time_range. I search for all relevant events at once, as well as making sure not to capture any events that may not have a defined user.

Transposing the Data

Currently, we have a long table, meaning each user has multiple entries, one for each event that the user performed. Next, we will transform that table from one that is long to one that is wide.

We’ll go from data like this:

user_ideventfirst_occurance (in long format)
123homepage13413223
123put_item_in_cart13415841
123checkout13416013
456homepage13984713
456put_item_in_cart13984890
To like this:
user_idhomepageput_item_in_cartcheckout
12313413223

13415841
13416013
45613984713
13984890NULL
This format will allow us to compare events within the context of each user and if they made it through the pipeline in the proper order.

query_2 is defined as below:

Calculating the Funnel

In addition to calculating the counts of each user as they move through the funnel, I’ve also included total counts of all users who have performed that event outside of our defined funnel rules. This will allow insight into whether or not the funnel is appropriately designed.

A well designed funnel should show similar counts between the funnel count and the total count. A large discrepancy between the two may indicate that users are using your site differently than you have designed it.

We define query_3 as below:

Earlier, we defined that the first occurrence of the first event in our funnel as the start of the clock. Luckily, homepage_total_count does double duty as it is both the count of start of the funnel as well as the count of all users who performed the event.

Finally, putting together query_1, query_2, and query_3 we are able to obtain one query that calculates the funnel from start to finish.

Which returns the following results:

homepageput_item_in_cartcheckoutconversion_1_2conversion_2_3
7141661659715

Visualizing the Data

Now that we have our final counts, let’s move to R to help visualize the data. We’ll have to some minor transformations before visualizing the data using the ggplot2 package. For code simplicity’s sake, I’ve removed any options that change the aesthetics of the graph and ask you to refer to the official ggplot2 documentation if you wish to change the color and format of the graph.

Let df be the dataframe of the data that we produced above. Using the melt() from the reshape2 package, I can reformat the data to make it easier to use with ggplot2:

Then to create the graph:

Which produces the image below. (Remember, I’ve stripped out any formatting code).

Next Steps

While this analysis is useful for looking at how your funnel is performing within a defined time period, I prefer analyses that gives context to the current funnel performance. In Using SQL and R to do Funnel Analysis – Part Two, I’ll show how to calculate conversation events and compare your funnel analysis by month over a six month period.

Diana Shealy
Diana Shealy
Diana Shealy is a Data Scientist at Treasure Data. After graduating from college in 2012, she moved to the Bay Area to begin her career. Outside of work, she enjoys traveling, music, and cooking.
In Case You Missed It