I have a DataFrame that looks like this:
fruit
0 orange
1 orange
2 orange
3 pear
4 orange
5 apple
6 apple
7 pear
8 pear
9 orange
I want to add a column that counts the cumulative occurrences of each value, i.e.
fruit cum_count
0 orange 1
1 orange 2
2 orange 3
3 pear 1
4 orange 4
5 apple 1
6 apple 2
7 pear 2
8 pear 3
9 orange 5
At the moment I'm doing it like this:
df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
... which is fine for 10 rows, but takes a really long time when I'm trying to do the same thing with a few million rows. Is there a more efficient way to do this?
To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.
You could use groupby
and cumcount
:
df['cum_count'] = df.groupby('fruit').cumcount() + 1
In [16]: df
Out[16]:
fruit cum_count
0 orange 1
1 orange 2
2 orange 3
3 pear 1
4 orange 4
5 apple 1
6 apple 2
7 pear 2
8 pear 3
9 orange 5
Timing
In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop
In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop
So it's faster in 4 times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With