Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Counting cumulative occurrences of values in a pandas series

Tags:

python

pandas

I have a DataFrame that looks like this:

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange

I want to add a column that counts the cumulative occurrences of each value, i.e.

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

At the moment I'm doing it like this:

df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]

... which is fine for 10 rows, but takes a really long time when I'm trying to do the same thing with a few million rows. Is there a more efficient way to do this?

like image 628
Li-Wen Yip Avatar asked Feb 18 '16 14:02

Li-Wen Yip


People also ask

How do you count occurrences in Pandas series?

To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you count the occurrences of a value in a Pandas DataFrame row?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

How do you get the cumulative sum of a column in Pandas?

The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.


1 Answers

You could use groupby and cumcount:

df['cum_count'] = df.groupby('fruit').cumcount() + 1

In [16]: df
Out[16]:
    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

Timing

In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop

In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop

So it's faster in 4 times.

like image 176
Anton Protopopov Avatar answered Oct 19 '22 01:10

Anton Protopopov