I have a dataframe like this:
id date company ......
123 2019-01-01 A
224 2019-01-01 B
345 2019-01-01 B
987 2019-01-03 C
334 2019-01-03 C
908 2019-01-04 C
765 2019-01-04 A
554 2019-01-05 A
482 2019-01-05 D
and I want to get the cumulative number of unique values over time for the 'company' column. So if a company appears at a later date they are not counted again.
My expected output is:
date cumulative_count
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
I've tried:
df.groupby(['date']).company.nunique().cumsum()
but this double counts if the same company appears on a different date.
Pandas Series: cumsum() function The cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum. The index or the name of the axis. 0 is equivalent to None or 'index'.
We declare an empty list cum_list to which we will append elements to form the cumulative sum list. Initialize a sum variable sm=0. Start iterating over the input list, with each iteration we increment the sum value to previous value+ the current element. On each iteration, the sum value is appended to the cum_list.
sum() function returns the sum of the values for the requested axis. Parameters: axis : {index (0), columns (1)}
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
Using duplicated
+ cumsum
+ last
m = df.duplicated('company')
d = df['date']
(~m).cumsum().groupby(d).last()
date
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
dtype: int32
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With