I have a dataframe with columns: <ol> <li> <code>diff</code> - difference between registration date and payment date,in days</li> <li> <code>country</code> - country of user</li> <li><code>user_id</code></li> <li> <code>campaign_id</code> -- another categorical column, we will use it in groupby</li> </ol> I need to calculate count distinct users for every <code>country</code>+<code>campaign_id</code> group who has <code>diff</code><=n. For example, for <code>country</code> 'A', <code>campaign</code> 'abc' and <code>diff</code> 7 i need to get count distinct users from <code>country</code> 'A', <code>campaign</code> 'abc' and <code>diff</code> <= 7 My current solution(below) works too long <pre class="prettyprint"><code>import pandas as pd import numpy as np ## generate test dataframe df = pd.DataFrame({ 'country':np.random.choice(['A', 'B', 'C', 'D'], 10000), 'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000), 'diff':np.random.choice(range(10), 10000), 'user_id': np.random.choice(range(1000), 10000) }) ## main result_df = pd.DataFrame() for diff in df['diff'].unique(): tmp_df = df.loc[df['diff']<=diff,:] tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index() tmp_df['diff'] = diff tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff'] result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0) </code></pre> Maybe there is better way to do this?

First use list comprehension with <code>concat</code> and <code>assign</code> for join all together and then <code>groupby</code> with <code>nunique</code> with adding column <code>diff</code>, last rename columns and if necessary add <code>reindex</code> for custom columns order: <pre class="prettyprint"><code>df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in df['diff'].unique()]) df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id'] .nunique() .reset_index() .rename(columns={'user_id':'unique_ppl'}) .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff'])) </code></pre>

Calculate nunique() for groupby in pandas

Q: How to count unique values in a pandas groupby object?

How to count unique values in a Pandas Groupby object? - GeeksforGeeks How to count unique values in a Pandas Groupby object? Groupby as the name suggests groups attributes on the basis of similarity in some value. We can count the unique values in pandas Groupby object using groupby (), agg (), and reset_index () method.

Q: What is the use of groupby in pandas?

groupby () – groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. Syntax: DataFrame.groupby (by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Q: What is dataframegroupby nunique?

DataFrameGroupBy.nunique(dropna=True)[source]¶ Return DataFrame with counts of unique elements in each position. Parameters dropnabool, default True Don’t include NaN in the counts. Returns nunique: DataFrame Examples

Q: How do I select a group in a pandas group?

Selecting a groups In order to select a group, we can select group using GroupBy.get_group (). We can select a group by applying a function GroupBy.get_group this function select a single group. import pandas as pd

Tags:

python

pandas

pandas-groupby

I have a dataframe with columns:

diff - difference between registration date and payment date,in days
country - country of user
user_id
campaign_id -- another categorical column, we will use it in groupby

I need to calculate count distinct users for every country+campaign_id group who has diff<=n. For example, for country 'A', campaign 'abc' and diff 7 i need to get count distinct users from country 'A', campaign 'abc' and diff <= 7

My current solution(below) works too long

import pandas as pd
import numpy as np

## generate test dataframe
df = pd.DataFrame({
        'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
        'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
        'diff':np.random.choice(range(10), 10000),
        'user_id': np.random.choice(range(1000), 10000)
        })
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
    tmp_df = df.loc[df['diff']<=diff,:]
    tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
    tmp_df['diff'] = diff
    tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
    result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

Maybe there is better way to do this?

348

asked Mar 15 '18 10:03

Slavka

2 Answers

First use list comprehension with concat and assign for join all together and then groupby with nunique with adding column diff, last rename columns and if necessary add reindex for custom columns order:

df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
          .nunique()
          .reset_index()
          .rename(columns={'user_id':'unique_ppl'})
          .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))

154

answered Sep 24 '22 12:09

jezrael

One alternative below, but @jezrael's solution is optimal.

Performance benchmarking

%timeit original(df)  # 149ms
%timeit jp(df)        # 81ms
%timeit jez(df)       # 47ms

def original(df):
    result_df = pd.DataFrame()
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

    return result_df

def jp(df):

    result_df = pd.DataFrame()
    lst = []
    lst_append = lst.append
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        lst_append(tmp_df)

    result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)

    return result_df

def jez(df):
    df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
    df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
              .nunique()
              .reset_index()
              .rename(columns={'user_id':'unique_ppl'})
              .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
    return df2

answered Sep 20 '22 12:09

jpp

Related questions
                            
                                How to append a row to another dataframe
                            
                                Anonymize specific columns with pii in pandas dataframe python
                            
                                Reading a large csv from a S3 bucket using python pandas in AWS Sagemaker
                            
                                What's the best way to enumerate permutations of deck of cards?
                            
                                Error when using include('admin.site.urls'): Passing a 3-tuple to include() is not supported
                            
                                Change content of image interactively using slider widgets
                            
                                argparse metavar for nargs='+' to get numbered arguments in help info?
                            
                                Numpy set absolute value in place
                            
                                ValueError: zero-dimensional arrays cannot be concatenated
                            
                                PySpark - Convert to JSON row by row
                            
                                In FT2Font: Can not load face
                            
                                ValueError: Invalid endpoint: s3-api.xxxx.objectstorage.service.networklayer.com
                            
                                Difference between apply() and apply_async() in Python multiprocessing module
                            
                                Django restframework, extra_kwargs not working
                            
                                Django: How to redirect with arguments
                            
                                How to prevent PyCharm from overriding default backend as set in matplotlib?
                            
                                PIP (Python) : ImportError: cannot import name _remove_dead_weakref
                            
                                Filtering with MultiIndex
                            
                                Numpy array: group by one column, sum another
                            
                                What does it mean for a tensor to have shape [None, x] in TensorFlow? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With