Python: Counting cumulative occurrences of values in a pandas series

Tags:

python

pandas

I have a DataFrame that looks like this:

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange

I want to add a column that counts the cumulative occurrences of each value, i.e.

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

At the moment I'm doing it like this:

df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]

... which is fine for 10 rows, but takes a really long time when I'm trying to do the same thing with a few million rows. Is there a more efficient way to do this?

628

asked Feb 18 '16 14:02

Li-Wen Yip

1 Answers

You could use groupby and cumcount:

df['cum_count'] = df.groupby('fruit').cumcount() + 1

In [16]: df
Out[16]:
    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

Timing

In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop

In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop

So it's faster in 4 times.

176

answered Oct 19 '22 01:10

Anton Protopopov

Related questions
                            
                                merge and sum two dataframes where columns match python pandas
                            
                                Long to wide data. Pandas
                            
                                re.split with spaces in python
                            
                                Why is numpy list access slower than vanilla python?
                            
                                Environmental path to Python not working?
                            
                                OCaml map a string to a list of strings
                            
                                Decoding Ebcdic
                            
                                Drop multi-indexed rows of a DataFrame based on 'AND' condition between levels
                            
                                PILKit was unable to import the Python Imaging Library
                            
                                Removing columns which has only "nan" values from a NumPy array
                            
                                how to copy an array into a bigger array(partial copy)
                            
                                Using StatsModels to plot quantile regression for 2nd order polynomial
                            
                                Vagrant not installing pip during provision
                            
                                Custom iteration behavior in dict subclass
                            
                                Pylint complains "no value for argument 'cls'"
                            
                                How do I call the Google Vision API with an image stored in Google Cloud Storage?
                            
                                How to extract a Google link's href from search results with Selenium?
                            
                                How to have different results for 'list' (players/) and 'detail' (players/{id})?
                            
                                Matplotlib multiprocessing fonts corruption using savefig
                            
                                Understanding difference between Double Quote and Single Quote with __repr__()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With