Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate nunique() for groupby in pandas

I have a dataframe with columns:

  1. diff - difference between registration date and payment date,in days
  2. country - country of user
  3. user_id
  4. campaign_id -- another categorical column, we will use it in groupby

I need to calculate count distinct users for every country+campaign_id group who has diff<=n. For example, for country 'A', campaign 'abc' and diff 7 i need to get count distinct users from country 'A', campaign 'abc' and diff <= 7

My current solution(below) works too long

import pandas as pd
import numpy as np

## generate test dataframe
df = pd.DataFrame({
        'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
        'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
        'diff':np.random.choice(range(10), 10000),
        'user_id': np.random.choice(range(1000), 10000)
        })
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
    tmp_df = df.loc[df['diff']<=diff,:]
    tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
    tmp_df['diff'] = diff
    tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
    result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

Maybe there is better way to do this?

like image 348
Slavka Avatar asked Mar 15 '18 10:03

Slavka


People also ask

How to count unique values in a pandas groupby object?

How to count unique values in a Pandas Groupby object? - GeeksforGeeks How to count unique values in a Pandas Groupby object? Groupby as the name suggests groups attributes on the basis of similarity in some value. We can count the unique values in pandas Groupby object using groupby (), agg (), and reset_index () method.

What is the use of groupby in pandas?

groupby () – groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. Syntax: DataFrame.groupby (by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

What is dataframegroupby nunique?

DataFrameGroupBy.nunique(dropna=True)[source]¶ Return DataFrame with counts of unique elements in each position. Parameters dropnabool, default True Don’t include NaN in the counts. Returns nunique: DataFrame Examples

How do I select a group in a pandas group?

Selecting a groups In order to select a group, we can select group using GroupBy.get_group (). We can select a group by applying a function GroupBy.get_group this function select a single group. import pandas as pd


2 Answers

First use list comprehension with concat and assign for join all together and then groupby with nunique with adding column diff, last rename columns and if necessary add reindex for custom columns order:

df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
          .nunique()
          .reset_index()
          .rename(columns={'user_id':'unique_ppl'})
          .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
like image 154
jezrael Avatar answered Sep 24 '22 12:09

jezrael


One alternative below, but @jezrael's solution is optimal.

Performance benchmarking

%timeit original(df)  # 149ms
%timeit jp(df)        # 81ms
%timeit jez(df)       # 47ms

def original(df):
    result_df = pd.DataFrame()
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

    return result_df

def jp(df):

    result_df = pd.DataFrame()
    lst = []
    lst_append = lst.append
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        lst_append(tmp_df)

    result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)

    return result_df

def jez(df):
    df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
    df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
              .nunique()
              .reset_index()
              .rename(columns={'user_id':'unique_ppl'})
              .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
    return df2
like image 29
jpp Avatar answered Sep 20 '22 12:09

jpp