I have a data frame like this:
0 04:10 obj1
1 04:10 obj1
2 04:11 obj1
3 04:12 obj2
4 04:12 obj2
5 04:12 obj1
6 04:13 obj2
Wanted to get a cumulative count for all the objects like this:
idx time object obj1_count obj2_count
0 04:10 obj1 1 0
1 04:10 obj1 2 0
2 04:11 obj1 3 0
3 04:12 obj2 3 1
4 04:12 obj2 3 2
5 04:12 obj1 4 2
6 04:13 obj2 4 3
Tried playing with cumsum but not sure that is the right way. Any suggestions?
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .
The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
cumsum() to find cumulative sum of a Series. Pandas Series. cumsum() is used to find Cumulative sum of a series. In cumulative sum, the length of returned series is same as input and every element is equal to sum of all previous elements.
There is a special function for such operation: cumcount
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], columns=['A'])
>>> df
A
0 a
1 a
2 a
3 b
4 b
5 a
>>> df.groupby('A').cumcount()
0 0
1 1
2 2
3 0
4 1
5 3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)
0 3
1 2
2 1
3 1
4 0
5 0
dtype: int64
You can just compare the column against the value of interest and call cumsum
:
In [12]:
df['obj1_count'] = (df['object'] == 'obj1').cumsum()
df['obj2_count'] = (df['object'] == 'obj2').cumsum()
df
Out[12]:
time object obj1_count obj2_count
idx
0 04:10 obj1 1 0
1 04:10 obj1 2 0
2 04:11 obj1 3 0
3 04:12 obj2 3 1
4 04:12 obj2 3 2
5 04:12 obj1 4 2
6 04:13 obj2 4 3
Here the comparison will produce a boolean series:
In [13]:
df['object'] == 'obj1'
Out[13]:
idx
0 True
1 True
2 True
3 False
4 False
5 True
6 False
Name: object, dtype: bool
when you call cumsum
on the above the True
values are converted to 1
and False
to 0
and are summed cumulatively
You can generalize this process by getting the cumsum
of pd.get_dummies
. This should work for an arbitrary number of objects you want to count, without needing to specify each one individually:
# Get the cumulative counts.
counts = pd.get_dummies(df['object']).cumsum()
# Rename the count columns as appropriate.
counts = counts.rename(columns=lambda col: col+'_count')
# Join the counts to the original df.
df = df.join(counts)
The resulting output:
time object obj1_count obj2_count
0 04:10 obj1 1 0
1 04:10 obj1 2 0
2 04:11 obj1 3 0
3 04:12 obj2 3 1
4 04:12 obj2 3 2
5 04:12 obj1 4 2
6 04:13 obj2 4 3
You can omit the rename
step if it's acceptable to use count as a prefix instead of a suffix, i.e. 'count_obj1'
instead of 'obj1_count'
. Simply use the prefix
parameter of pd.get_dummies
:
counts = pd.get_dummies(df['object'], prefix='count').cumsum()
Here's a way using numpy
u, iv = np.unique(
df.object.values,
return_inverse=True
)
objcount = pd.DataFrame(
(iv[:, None] == np.arange(len(u))).cumsum(0),
df.index, u
)
pd.concat([df, objcount], axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With