I've got a pandas DataFrame with a boolean column sorted by another column and need to calculate reverse cumulative sum of the boolean column, that is, amount of true values from current row to bottom.
Example
In [13]: df = pd.DataFrame({'A': [True] * 3 + [False] * 5, 'B': np.random.rand(8) }) In [15]: df = df.sort_values('B') In [16]: df Out[16]: A B 6 False 0.037710 2 True 0.315414 4 False 0.332480 7 False 0.445505 3 False 0.580156 1 True 0.741551 5 False 0.796944 0 True 0.817563
I need something that will give me a new column with values
3 3 2 2 2 2 1 1
That is, for each row it should contain amount of True values on this row and rows below.
I've tried various methods using .iloc[::-1]
but result is not that is desired.
It looks like I'm missing some obvious bit of information. I've starting using Pandas only yesterday.
The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.
Pandas Series: cumsum() function The cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum.
Reverse column A, take the cumsum, then reverse again:
df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]
import pandas as pd df = pd.DataFrame( {'A': [False, True, False, False, False, True, False, True], 'B': [0.03771, 0.315414, 0.33248, 0.445505, 0.580156, 0.741551, 0.796944, 0.817563],}, index=[6, 2, 4, 7, 3, 1, 5, 0]) df['C'] = df.loc[::-1, 'A'].cumsum()[::-1] print(df)
yields
A B C 6 False 0.037710 3 2 True 0.315414 3 4 False 0.332480 2 7 False 0.445505 2 3 False 0.580156 2 1 True 0.741551 2 5 False 0.796944 1 0 True 0.817563 1
Alternatively, you could count the number of True
s in column A
and subtract the (shifted) cumsum:
In [113]: df['A'].sum()-df['A'].shift(1).fillna(0).cumsum() Out[113]: 6 3 2 3 4 2 7 2 3 2 1 2 5 1 0 1 Name: A, dtype: object
But this is significantly slower. Using IPython to perform the benchmark:
In [116]: df = pd.DataFrame({'A':np.random.randint(2, size=10**5).astype(bool)}) In [117]: %timeit df['A'].sum()-df['A'].shift(1).fillna(0).cumsum() 10 loops, best of 3: 19.8 ms per loop In [118]: %timeit df.loc[::-1, 'A'].cumsum()[::-1] 1000 loops, best of 3: 701 µs per loop
Similar to unutbus first suggestion, but without the deprecated ix:
df['C']=df.A[::-1].cumsum()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With