Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cohort analysis in SQL

Tags:

sql

mysql

Looking to do some cohort analysis on a userbase. We have 2 tables "users" and "sessions", where users and sessions both have a "created_at" field. I'm looking to formulate a query that yields a 7 by 7 table of numbers (with some blanks) that shows me: a count of users who were created on a particular day who also have a session created y = (0..6 days ago), indicating that he returned on that day.

created_at  d2  d3  d4
today       *   *   *
today-1     49  *   *
today-2     45  30  *
today-3     47  48  18
...

In this case, 47 users who were created on today-3 returned on today-2.

Can I perform this in a single MySQL query? I can perform the queries individually like so, but it'd be really nice to have it all in one query.

SELECT `users`.* FROM `users` INNER JOIN `sessions` ON `sessions`.`user_id` = `users`.`id` WHERE `users`.`os` = 'ios' AND (`sessions`.`updated_at` BETWEEN '2013-01-16 08:00:00' AND '2013-01-17 08:00:00')
like image 233
Newy Avatar asked Jan 22 '13 06:01

Newy


2 Answers

This seems a complex problem. Regardless of whether it also seems to you a difficult one or not, it is never a bad idea to start working it up from a smaller problem.

You could start, for instance, with a query returning all the users (just the users) that have been registered within the last week, i.e. starting from the day six days from now, as per your requirement:

SELECT *
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

The next step could be grouping the results by dates and counting rows in every group:

SELECT
  created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY created_at

If created_at is a datetime or timestamp, use DATE(created_at) as the grouping criterion:

SELECT
  DATE(created_at) AS created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY DATE(created_at)

However, you don't seem to want absolute dates in the output, but only relative ones, like today, today - 1 day etc. In that case, you could use the DATEDIFF() function, which returns the number of days between two dates, to produce (numeric) offsets from today and group by those values:

SELECT
  DATEDIFF(CURDATE(), created_at) AS created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY DATE(created_at)

Your created_at column would contain "dates" like 0, 1 and so on till 6. Converting them into today, today-1 etc. is trivial and you will see that in the final query. So far, however, we've reached the point at which we need to take one step back (or, perhaps, it's rather a half step to the right), because we don't really need to count the users but rather their returns. So, the actual working dataset from users that is needed at the moment will be this:

SELECT
  id,
  DATEDIFF(CURDATE(), created_at) AS day_offset
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

We need user IDs to join this rowset to (the one that will be derived from) sessions and we need day_offset as the grouping criterion.

Moving on, a similar transformation will need to be performed on the sessions table, and I won't go into details on that. Suffice it to say that the resulting query will be very identical to the last one, with just two exception:

  • id gets replaced with user_id;

  • DISTINCT is applied to the entire subset.

The reason for DISTINCT is to return no more than one row per user & day: it is my understanding that however many sessions a user might have on a particular day, you want to count them as one return. So, here's what gets derived from sessions:

SELECT DISTINCT
  user_id,
  DATEDIFF(CURDATE(), created_at) AS day_offset
FROM sessions
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

Now it only remains to join the two derived tables, apply grouping and use conditional aggregation to get the required results:

SELECT
  CONCAT('today', IFNULL(CONCAT('-', NULLIF(u.DayOffset, 0)), '')) AS created_at,
  SUM(s.DayOffset = 0) AS d0,
  SUM(s.DayOffset = 1) AS d1,
  SUM(s.DayOffset = 2) AS d2,
  SUM(s.DayOffset = 3) AS d3,
  SUM(s.DayOffset = 4) AS d4,
  SUM(s.DayOffset = 5) AS d5,
  SUM(s.DayOffset = 6) AS d6
FROM (
  SELECT
    id,
    DATEDIFF(CURDATE(), created_at) AS DayOffset
  FROM users
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) u
LEFT JOIN (
  SELECT DISTINCT
    user_id,
    DATEDIFF(CURDATE(), created_at) AS DayOffset
  FROM sessions
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) s
ON u.id = s.user_id
GROUP BY u.DayOffset
;

I must admit that I haven't tested/debugged this, but, if this be needed, I'll be happy to work with the data samples you will have provided, once you have provided them. :)

like image 179
Andriy M Avatar answered Oct 12 '22 05:10

Andriy M


Example Of Month Wise Cohort:

First Let's Create Table Individual User Activity Flow (MONTH WISE):

SELECT 
    mu.created_timestamp AS cohort
    , mu.id AS user_id
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 1 AND l.user_id = mu.id) AS m1
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 2 AND l.user_id = mu.id) AS m2
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 3 AND l.user_id = mu.id) AS m3
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 4 AND l.user_id = mu.id) AS m4
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 5 AND l.user_id = mu.id) AS m5
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 6 AND l.user_id = mu.id) AS m6
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 7 AND l.user_id = mu.id) AS m7
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 8 AND l.user_id = mu.id) AS m8
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 9 AND l.user_id = mu.id) AS m9
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 10 AND l.user_id = mu.id) AS m10
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 11 AND l.user_id = mu.id) AS m11
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 12 AND l.user_id = mu.id) AS m12
FROM user mu 
WHERE mu.created_timestamp BETWEEN '2018-01-01 00:00:00' AND '2019-12-31 23:59:59'

Then After This Table Calculate the individual activity-sum of the user:

SELECT MONTH(c.cohort) AS cohort
       ,COUNT(c.user_id) AS signups
       ,SUM(c.m1) AS m1 
       ,SUM(c.m2) AS m2 
       ,SUM(c.m3) AS m3 
       ,SUM(c.m4) AS m4 
       ,SUM(c.m5) AS m5 
       ,SUM(c.m6) AS m6 
       ,SUM(c.m7) AS m7 
       ,SUM(c.m8) AS m8 
       ,SUM(c.m9) AS m9 
       ,SUM(c.m10) AS m10 
       ,SUM(c.m11) AS m11 
       ,SUM(c.m12) AS m12 
FROM (SELECT 
    mu.created_timestamp AS cohort
    , mu.id AS user_id
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 1 AND l.user_id = mu.id) AS m1
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 2 AND l.user_id = mu.id) AS m2
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 3 AND l.user_id = mu.id) AS m3
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 4 AND l.user_id = mu.id) AS m4
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 5 AND l.user_id = mu.id) AS m5
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 6 AND l.user_id = mu.id) AS m6
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 7 AND l.user_id = mu.id) AS m7
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 8 AND l.user_id = mu.id) AS m8
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 9 AND l.user_id = mu.id) AS m9
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 10 AND l.user_id = mu.id) AS m10
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 11 AND l.user_id = mu.id) AS m11
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 12 AND l.user_id = mu.id) AS m12
FROM user mu 
WHERE mu.created_timestamp BETWEEN '2018-01-01 00:00:00' AND '2019-12-31 23:59:59') AS c GROUP BY MONTH(cohort)

In replacement of months you can use days, other wise cohort analysis mostly use in month cases

like image 44
Shayan Shaikh Avatar answered Oct 12 '22 04:10

Shayan Shaikh