Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Funnel analysis calculation, how would you calculate a funnel?

Suppose that I track an 'event' a user takes on a website, events can be things like:

  1. viewed homepage
  2. added item to cart
  3. checkout
  4. paid for order

Now each of those events are stored in a database like:

session_id event_name created_date ..

So now I want to build a report to display a particular funnel that I will define like:

Step#1   event_n
Step#2   event_n2
Step#3   event_n3

So this particular funnel has 3 steps, and each step is associated with ANY event.

How can I build a report for this now given the above data I have?

Note: just want to be clear, I want to be able to create any funnel that I define, and be able to create a report for it.

The most basic way I can think of is:

  1. get all events for each step I have in my database
  2. step#1 will be, x% of people performed event_n
  3. Now I will have to query the data for step#2 who ALSO performed step#1, and display the %
  4. Same as #3 but for step#3 with the condition for step#2

I'm curious how these online services can display these types of reports in a hosted Saas environment. Does map-reduce make this easier somehow?

like image 493
Blankman Avatar asked May 12 '12 19:05

Blankman


2 Answers

First the answer, using standard SQL, given your hypothesis: there is a table EVENTS with a simple layout:

EVENTS
-----------------------------
SESION_ID , EVENT_NAME , TMST

To get the session that performed step#1 at some time:

-- QUERY 1
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID;

Here I make the assumption that event1 can happen more then once per session. The result is a list of unique session that demonstrated event1 at some time.

In order to get step2 and step3, I can just do the same:

-- QUERY 2
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID;
-- QUERY 3
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID;

Now, you want to select sessions that performed step1, step2 and step3 - in that order. More precisely you need to count sessions that performed step 1, then count session that performed step2, then count sessions that performed step3. Basically we just need to combine the 3 above queries with left join to list the sessions that entered the funnel and which steps they performed:

-- FUNNEL FOR S1/S2/S3
SELECT 
  SESSION_ID, 
  Q1.TMST IS NOT NULL AS PERFORMED_STEP1,
  Q2.TMST IS NOT NULL AS PERFORMED_STEP2,
  Q3.TMST IS NOT NULL AS PERFORMED_STEP3
FROM
  -- QUERY 1
  (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1,
LEFT JOIN
  -- QUERY 2
  (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2,
LEFT JOIN
  -- QUERY 3
  (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3
-- Q2 & Q3
ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST<Q3.TMST
-- Q1 & Q2
ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST<Q2.TMST

The result is a list of unique session who entered the funnel at step1, and may have continued to step2 and step3... e.g:

SESSION_ID_1,TRUE,TRUE,TRUE
SESSION_ID_2,TRUE,TRUE,FALSE
SESSION_ID_3,TRUE,FALSE,FALSE
...

Now we just have to compute some stats, for example:

SELECT
  STEP1_COUNT,
  STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1,
  STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2,
  STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2,
  STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3,
  STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE
FROM
(-- QUERY TO COUNT session at each step
  SELECT
    SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT,
    SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT,
    SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT
  FROM
    [... insert the funnel query here ...]
) AS COMPUTE_STEPS

Et voilà !

Now for the discussion. First point, the result is pretty straightforward given you take the "set"(or functional) way of thinking and not the "procedural" approach. Don't visualize the database as a collection of fixed tables with columns and rows... this is how it is implemented, but it is not the way you interact with it. It's all sets, and you can arrange the sets like the way you need!

Second point that query will be automatically optimized to run in parallel if you are using a MPP database for instance. You don't even need to program the query differently, use map-reduce or whatever... I ran the same query on my test dataset with more than 100 millions events and get results in seconds.

Last but not least, the query opens endless possibilities. Just group by the results by the referer, keywords, landing-page, user informations, and analyse which provides the best convertion rate for instance!

like image 192
SergeFantino Avatar answered Sep 30 '22 18:09

SergeFantino


The core problem in the way you are thinking about this is that you are thinking in a SQL/table type model. Each event is one record. One of the nice things about NoSQL technologies (which you feel an inkling towards) is that you can naturally store the record as one session per record. Once you store the data in a session-based manner, you can write a routine that checks to see if that session complies with the pattern or not. No need to do joins or anything, just a loop over a list of transactions in a session. Such is the power of semi-structured data.

What if you store your sessions together? Then, all you have to do is iterate through each session and see if it matches.

This is a fantastic use case for HBase, in my opinion.

With HBase, you store the session ID as the row key, then each of the events as values with the time stamp as the column qualifier. What this leaves you with is data that is grouped together by session ID, then sorted by time.

Ok, so now you want to figure out what % of sessions enacted behavior 1, then 2, then 3. You run a MapReduce job over this data. The MapReduce job will provide you one session per row key/value pair. Write a loop over the data to check to see if it matches the pattern. If it does count + 1, if not, don't.


Without going all out with HBase, you can use MapReduce to sessionize your unorganized data at rest. Group by the session ID, then in the reducer you'll have all of the events associated with that session grouped together. Now, you're basically where you were with HBase where you can write a method in the reducer that checks for the pattern.


HBase might be overkill if you don't have a ridiculous amount of data. Any sort of database that can store data hierarchically will be good in this situation. MongoDB, Cassandra, Redis all come to mind and have their strengths and weaknesses.

like image 30
Donald Miner Avatar answered Sep 30 '22 17:09

Donald Miner