I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this: <pre class="prettyprint lang-python prettyprint-override"><code>df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']}) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+---------+------+-----------+ | | user_id | week | module_id | +---+---------+------+-----------+ | 0 | 1 | 1 | A | | 1 | 1 | 1 | B | | 2 | 1 | 2 | A | | 3 | 2 | 1 | A | | 4 | 2 | 2 | B | | 5 | 2 | 2 | C | +---+---------+------+-----------+ </code></pre> What I want is a running count of the number of unique module_ids up to each week, i.e. something like this: <pre class="prettyprint lang-none prettyprint-override"><code>+---+---------+------+-------------------------+ | | user_id | week | cumulative_module_count | +---+---------+------+-------------------------+ | 0 | 1 | 1 | 2 | | 1 | 1 | 2 | 2 | | 2 | 2 | 1 | 1 | | 3 | 2 | 2 | 3 | +---+---------+------+-------------------------+ </code></pre> It is straightforward to do this as a loop, for example this works: <pre class="prettyprint"><code>running_tally = {} result = {} for index, row in df.iterrows(): if row['user_id'] not in running_tally: running_tally[row['user_id']] = set() result[row['user_id']] = {} running_tally[row['user_id']].add(row['module_id']) result[row['user_id']][row['week']] = len(running_tally[row['user_id']]) print(result) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}} </code></pre> But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop. There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do. How would I do this vectorised in pandas?

Idea is create <code>list</code>s per groups by both columns and then use <code>np.cumsum</code> for cumulative lists, last convert values to sets and get length: <pre class="prettyprint"><code>df1 = (df.groupby(['user_id','week'])['module_id'] .apply(list) .groupby(level=0) .apply(np.cumsum) .apply(lambda x: len(set(x))) .reset_index(name='cumulative_module_count')) print (df1) user_id week cumulative_module_count 0 1 1 2 1 1 2 2 2 2 1 1 3 2 2 3 </code></pre>

Cumulative count of unique values in pandas

Tags:

pandas

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:

df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})

+---+---------+------+-----------+
|   | user_id | week | module_id |
+---+---------+------+-----------+
| 0 |       1 |    1 |         A |
| 1 |       1 |    1 |         B |
| 2 |       1 |    2 |         A |
| 3 |       2 |    1 |         A |
| 4 |       2 |    2 |         B |
| 5 |       2 |    2 |         C |
+---+---------+------+-----------+

What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:

+---+---------+------+-------------------------+
|   | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 |       1 |    1 |                       2 |
| 1 |       1 |    2 |                       2 |
| 2 |       2 |    1 |                       1 |
| 3 |       2 |    2 |                       3 |
+---+---------+------+-------------------------+

It is straightforward to do this as a loop, for example this works:

running_tally = {}
result = {}
for index, row in df.iterrows():
    if row['user_id'] not in running_tally:
        running_tally[row['user_id']] = set()
        result[row['user_id']] = {}
    running_tally[row['user_id']].add(row['module_id'])
    result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)

{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}

But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.

There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.

How would I do this vectorised in pandas?

526

asked Jul 16 '19 10:07

dumbledad

1 Answers

Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:

df1 = (df.groupby(['user_id','week'])['module_id']
         .apply(list)
         .groupby(level=0)
         .apply(np.cumsum)
         .apply(lambda x: len(set(x)))
         .reset_index(name='cumulative_module_count'))

print (df1)
   user_id  week  cumulative_module_count
0        1     1                        2
1        1     2                        2
2        2     1                        1
3        2     2                        3

104

answered Oct 03 '22 09:10

jezrael

Related questions
                            
                                Get a new dataframe with difference of every two rows in Pandas
                            
                                String Join treats True as Boolean rather than string
                            
                                How can I read pickle file containing pandas data frame from qrc resource file with pandas read_pickle?
                            
                                Is it possible to use a custom filter function in pandas?
                            
                                pandas: Fill missing dates when keeping duplicates
                            
                                Pandas DataFrame: mean of column B values within column A windows
                            
                                Convert UTC timestamp to local timezone issue in pandas
                            
                                Joining Two Different Dataframes on Timestamp
                            
                                Calculating Rolling forward averages with pandas
                            
                                Why is there so much speed difference between these two variants?
                            
                                Binary-vectorize pandas DataFrame column
                            
                                Take the difference of all elements of a series with the previous ones in python pandas
                            
                                Create new variables from row for each existing variable in pandas dataframe
                            
                                Generalize a function in python
                            
                                Pandas concat columns
                            
                                Large (6 million rows) pandas df causes memory error with `to_sql ` when chunksize =100, but can easily save file of 100,000 with no chunksize
                            
                                pandas dataframe with list elements: split, pad
                            
                                Pandas Dataframe How to cut off float decimal points without rounding?
                            
                                Pandas pd.to_datetime only keep time do not date
                            
                                Pandas - Row number since last greater than 0 value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With