I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
How would I do this vectorised in pandas?
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
Series containing counts of unique values in Pandas The value_counts() function is used to get a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Count Unique Rows in Pandas DataFrame Using nunique() method, we can count unique rows in pandas. by default nunique() shows axis=0 that means rows but it can be changed to axis=1.
Count of unique values in each column. Using the pandas dataframe nunique () function with default parameters gives a count of all the distinct values in each column. In the above example, the nunique () function returns a pandas Series with counts of distinct values in each column.
You can use the nunique () function to count the number of unique values in a pandas DataFrame. #count unique values in each column df.nunique() #count unique values in each row df.nunique(axis=1)
This can be particularly effective in seeing how much an item makes up the cumulative total. While there is no dedicated function for calculating cumulative percentages, we can use the Pandas .cumsum () method in conjunction with the .sum () method.
You can also get count of distinct values in each row by setting the axis parameter to 1 or 'columns' in the nunique () function. In the above example, you can see that we have 4 distinct values in each row except for the row with index 3 which has 3 unique values due to the presence of a NaN value.
Idea is create list
s per groups by both columns and then use np.cumsum
for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With