Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reversed cumulative sum of a column in pandas.DataFrame

Tags:

I've got a pandas DataFrame with a boolean column sorted by another column and need to calculate reverse cumulative sum of the boolean column, that is, amount of true values from current row to bottom.

Example

In [13]: df = pd.DataFrame({'A': [True] * 3 + [False] * 5, 'B': np.random.rand(8) })  In [15]: df = df.sort_values('B')  In [16]: df Out[16]:        A         B 6  False  0.037710 2   True  0.315414 4  False  0.332480 7  False  0.445505 3  False  0.580156 1   True  0.741551 5  False  0.796944 0   True  0.817563 

I need something that will give me a new column with values

3 3 2 2 2 2 1 1 

That is, for each row it should contain amount of True values on this row and rows below.

I've tried various methods using .iloc[::-1] but result is not that is desired.

It looks like I'm missing some obvious bit of information. I've starting using Pandas only yesterday.

like image 528
wl2776 Avatar asked Jun 17 '16 03:06

wl2776


People also ask

How do you get the cumulative sum of a column in pandas?

The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

What does Cumsum do pandas?

Pandas Series: cumsum() function The cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum.


2 Answers

Reverse column A, take the cumsum, then reverse again:

df['C'] = df.loc[::-1, 'A'].cumsum()[::-1] 

import pandas as pd df = pd.DataFrame(     {'A': [False, True, False, False, False, True, False, True],      'B': [0.03771, 0.315414, 0.33248, 0.445505, 0.580156, 0.741551, 0.796944, 0.817563],},      index=[6, 2, 4, 7, 3, 1, 5, 0]) df['C'] = df.loc[::-1, 'A'].cumsum()[::-1] print(df) 

yields

       A         B  C 6  False  0.037710  3 2   True  0.315414  3 4  False  0.332480  2 7  False  0.445505  2 3  False  0.580156  2 1   True  0.741551  2 5  False  0.796944  1 0   True  0.817563  1 

Alternatively, you could count the number of Trues in column A and subtract the (shifted) cumsum:

In [113]: df['A'].sum()-df['A'].shift(1).fillna(0).cumsum() Out[113]:  6    3 2    3 4    2 7    2 3    2 1    2 5    1 0    1 Name: A, dtype: object 

But this is significantly slower. Using IPython to perform the benchmark:

In [116]: df = pd.DataFrame({'A':np.random.randint(2, size=10**5).astype(bool)})  In [117]: %timeit df['A'].sum()-df['A'].shift(1).fillna(0).cumsum() 10 loops, best of 3: 19.8 ms per loop  In [118]: %timeit df.loc[::-1, 'A'].cumsum()[::-1] 1000 loops, best of 3: 701 µs per loop 
like image 190
unutbu Avatar answered Sep 21 '22 09:09

unutbu


Similar to unutbus first suggestion, but without the deprecated ix:

df['C']=df.A[::-1].cumsum() 
like image 23
Ichta Avatar answered Sep 23 '22 09:09

Ichta