Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative count of unique values in pandas

Tags:

pandas

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:

df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
|   | user_id | week | module_id |
+---+---------+------+-----------+
| 0 |       1 |    1 |         A |
| 1 |       1 |    1 |         B |
| 2 |       1 |    2 |         A |
| 3 |       2 |    1 |         A |
| 4 |       2 |    2 |         B |
| 5 |       2 |    2 |         C |
+---+---------+------+-----------+

What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:

+---+---------+------+-------------------------+
|   | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 |       1 |    1 |                       2 |
| 1 |       1 |    2 |                       2 |
| 2 |       2 |    1 |                       1 |
| 3 |       2 |    2 |                       3 |
+---+---------+------+-------------------------+

It is straightforward to do this as a loop, for example this works:

running_tally = {}
result = {}
for index, row in df.iterrows():
    if row['user_id'] not in running_tally:
        running_tally[row['user_id']] = set()
        result[row['user_id']] = {}
    running_tally[row['user_id']].add(row['module_id'])
    result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}

But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.

There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.

How would I do this vectorised in pandas?

like image 526
dumbledad Avatar asked Jul 16 '19 10:07

dumbledad


People also ask

How do you count unique occurrences in pandas?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do you count unique values in a Series in Python?

Series containing counts of unique values in Pandas The value_counts() function is used to get a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

How do you count how many unique rows a DataFrame has?

Count Unique Rows in Pandas DataFrame Using nunique() method, we can count unique rows in pandas. by default nunique() shows axis=0 that means rows but it can be changed to axis=1.

How do I Count unique values in a column in pandas?

Count of unique values in each column. Using the pandas dataframe nunique () function with default parameters gives a count of all the distinct values in each column. In the above example, the nunique () function returns a pandas Series with counts of distinct values in each column.

How to count the number of unique values in a Dataframe?

You can use the nunique () function to count the number of unique values in a pandas DataFrame. #count unique values in each column df.nunique() #count unique values in each row df.nunique(axis=1)

How to calculate the cumulative percentage of an item in pandas?

This can be particularly effective in seeing how much an item makes up the cumulative total. While there is no dedicated function for calculating cumulative percentages, we can use the Pandas .cumsum () method in conjunction with the .sum () method.

How to count the number of distinct values in a row?

You can also get count of distinct values in each row by setting the axis parameter to 1 or 'columns' in the nunique () function. In the above example, you can see that we have 4 distinct values in each row except for the row with index 3 which has 3 unique values due to the presence of a NaN value.


1 Answers

Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:

df1 = (df.groupby(['user_id','week'])['module_id']
         .apply(list)
         .groupby(level=0)
         .apply(np.cumsum)
         .apply(lambda x: len(set(x)))
         .reset_index(name='cumulative_module_count'))

print (df1)
   user_id  week  cumulative_module_count
0        1     1                        2
1        1     2                        2
2        2     1                        1
3        2     2                        3
like image 104
jezrael Avatar answered Oct 03 '22 09:10

jezrael