Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a retention cohort from a pandas dataframe

I have a pandas dataframe that looks like this:

+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1      | 2015-01-25       |             0 | NaT        |
| ACC2      | 2015-01-11       |             0 | NaT        |
| ACC3      | 2015-01-18       |             0 | NaT        |
| ACC4      | 2014-12-21       |            14 | 2015-02-12 |
| ACC5      | 2014-12-21       |             5 | 2015-02-15 |
| ACC6      | 2014-12-21       |             0 | 2015-02-22 |
+-----------+------------------+---------------+------------+

It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.

Each registration week is a cohort. To know how many people are part of the cohort I can use:

visit_log.groupby('RegistrationWeek').AccountID.nunique()

What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.

Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.

The end product would look something like this:

+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1             | 70%         | 30%         | 20%         |
| week2             | 70%         | 30%         |             |
| week3             | 40%         |             |             |
+-------------------+-------------+-------------+-------------+

I tried pivoting the dataframe like this:

visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')

But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.

I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!

Thanks

like image 402
grzlybear Avatar asked Feb 26 '15 14:02

grzlybear


2 Answers

There are several aspects to your question.

What you can build with the data you have

There are several kinds of retention. For simplicity, we’ll mention only two :

  • Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
  • Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.

If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.

So with this you can only go with option 2, rolling retention.

How to build the table

First, let's build a dummy data set so that we have enough to work on and you can reproduce it :

import pandas as pd
import numpy as np
import math
import datetime as dt

np.random.seed(0) # so that we all have the same results

def random_date(start, end,p=None):
    # Return a date randomly chosen between two dates
    if p is None:
        p = np.random.random()
    return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))

n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)

# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta 
start = end - relativedelta(years=1)

# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
                     index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))

So now we should have something that looks like this :

users.head()

enter image description here

Here is some code to build a cohort table :

### Some useful functions
def add_weeks(sourcedate,weeks):
    return sourcedate + dt.timedelta(days=7*weeks)

def first_day_of_week(sourcedate):
    return sourcedate - dt.timedelta(days = sourcedate.weekday())

def last_day_of_week(sourcedate):
    return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))  

def retained_in_interval(users,signup_week,n_weeks,end_date):
    '''
        For a given list of users, returns the number of users 
        that signed up in the week of signup_week (the cohort)
        and that are retained after n_weeks
        end_date is just here to control that we do not un-necessarily fill the bottom right of the table
    '''
    # Define the span of the given week
    cohort_start       = first_day_of_week(signup_week)
    cohort_end         = last_day_of_week(signup_week)
    if n_weeks == 0:
        # If this is our first week, we just take the number of users that signed up on the given period of time
        return len( users[(users['signup_date'] >= cohort_start) 
                        & (users['signup_date'] <= cohort_end)])
    elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
        # If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
        # We return some easily recognizable date (not 0 as it would cause confusion)
        return float("Inf")
    else:
        # Otherwise, we count the number of users that signed up on the given period of time,
        # and whose last known log was later than the number of weeks added (rolling retention)
        return len( users[(users['signup_date'] >= cohort_start) 
                        & (users['signup_date'] <= cohort_end)
                        & pd.to_datetime((users['last_log'])    >=  pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
                        ])

With this we can create the actual function :

def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
    '''
        For a given dataframe of users, return a cohort table with the following parameters :
        cohort_number : the number of lines of the table
        period_number : the number of columns of the table
        cohort_span : the span of every period of time between the cohort (D, W, M)
        end_date = the date after which we stop counting the users
    '''
    # the last column of the table will end today :
    if end_date is None:
        end_date = dt.datetime.today()
    # The index of the dataframe will be a list of dates ranging
    dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)

    cohort = pd.DataFrame(columns=['Sign up'])
    cohort['Sign up'] = dates
    # We will compute the number of retained users, column-by-column
    #      (There probably is a more pythonesque way of doing it)
    range_dates = range(0,period_number+1)
    for p in range_dates:
        # Name of the column
        s_p = 'Week '+str(p)
        cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)

    cohort = cohort.set_index('Sign up')        
    # absolute values to percentage by dividing by the value of week 0 :
    cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
    return cohort

Now you can call it and see the result :

cohort_table(users)

enter image description here

Hope it helps

like image 181
rom_j Avatar answered Oct 17 '22 22:10

rom_j


Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.

users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
                    .replace(0, pd.NaT)
                    .add(totals, axis=0)
                   )
retention = retention_counts.div(totals, axis=0)

realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)
like image 29
rump roast Avatar answered Oct 17 '22 22:10

rump roast