Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas cumulative count [duplicate]

I have a data frame like this:

0        04:10  obj1
1        04:10  obj1
2        04:11  obj1
3        04:12  obj2
4        04:12  obj2
5        04:12  obj1
6        04:13  obj2

Wanted to get a cumulative count for all the objects like this:

idx      time   object   obj1_count   obj2_count 
0        04:10  obj1        1             0
1        04:10  obj1        2             0
2        04:11  obj1        3             0
3        04:12  obj2        3             1
4        04:12  obj2        3             2
5        04:12  obj1        4             2
6        04:13  obj2        4             3

Tried playing with cumsum but not sure that is the right way. Any suggestions?

like image 420
jincept Avatar asked Nov 30 '16 23:11

jincept


People also ask

How can I count duplicate values in pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

How do you calculate cumulative sum in pandas?

The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

How do you count the number of occurrences in pandas?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you calculate cumulative in Python?

cumsum() to find cumulative sum of a Series. Pandas Series. cumsum() is used to find Cumulative sum of a series. In cumulative sum, the length of returned series is same as input and every element is equal to sum of all previous elements.


4 Answers

There is a special function for such operation: cumcount

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], columns=['A'])
>>> df
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)
0    3
1    2
2    1
3    1
4    0
5    0
 dtype: int64
like image 141
Alex Glinsky Avatar answered Oct 24 '22 05:10

Alex Glinsky


You can just compare the column against the value of interest and call cumsum:

In [12]:
df['obj1_count'] = (df['object'] == 'obj1').cumsum()
df['obj2_count'] = (df['object'] == 'obj2').cumsum()
df

Out[12]:
      time object  obj1_count  obj2_count
idx                                      
0    04:10   obj1           1           0
1    04:10   obj1           2           0
2    04:11   obj1           3           0
3    04:12   obj2           3           1
4    04:12   obj2           3           2
5    04:12   obj1           4           2
6    04:13   obj2           4           3

Here the comparison will produce a boolean series:

In [13]:
df['object'] == 'obj1'

Out[13]:
idx
0     True
1     True
2     True
3    False
4    False
5     True
6    False
Name: object, dtype: bool

when you call cumsum on the above the True values are converted to 1 and False to 0 and are summed cumulatively

like image 29
EdChum Avatar answered Oct 24 '22 05:10

EdChum


You can generalize this process by getting the cumsum of pd.get_dummies. This should work for an arbitrary number of objects you want to count, without needing to specify each one individually:

# Get the cumulative counts.
counts = pd.get_dummies(df['object']).cumsum()

# Rename the count columns as appropriate.
counts = counts.rename(columns=lambda col: col+'_count')

# Join the counts to the original df.
df = df.join(counts)

The resulting output:

    time object  obj1_count  obj2_count
0  04:10   obj1           1           0
1  04:10   obj1           2           0
2  04:11   obj1           3           0
3  04:12   obj2           3           1
4  04:12   obj2           3           2
5  04:12   obj1           4           2
6  04:13   obj2           4           3

You can omit the rename step if it's acceptable to use count as a prefix instead of a suffix, i.e. 'count_obj1' instead of 'obj1_count'. Simply use the prefix parameter of pd.get_dummies:

 counts = pd.get_dummies(df['object'], prefix='count').cumsum()
like image 3
root Avatar answered Oct 24 '22 06:10

root


Here's a way using numpy

u, iv = np.unique(
    df.object.values,
    return_inverse=True
)

objcount = pd.DataFrame(
    (iv[:, None] == np.arange(len(u))).cumsum(0),
    df.index, u
)
pd.concat([df, objcount], axis=1)

enter image description here

like image 2
piRSquared Avatar answered Oct 24 '22 05:10

piRSquared