Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reset Cumulative sum base on condition Pandas

I have a data frame like:

customer spend hurdle 
A         20    50      
A         31    50      
A         20    50      
B         50    100     
B         51    100    
B         30    100     

I want to calculate additional column for Cumulative which will reset base on the same customer when the Cumulative sum greater or equal to the hurdle like following :

customer spend hurdle Cumulative 
A         20    50      20
A         31    50      51
A         20    50      20
B         50    100     50
B         51    100    101
B         30    100     30

I used the cumsum and groupby in pandas to but I do not know how to reset it base on the condition.

Following are the code I am currently using:

df1['cum_sum'] = df1.groupby(['customer'])['spend'].apply(lambda x: x.cumsum())

which I know it is just a normal cumulative sum. I very appreciate for your help.

like image 929
user2741956 Avatar asked Oct 17 '17 07:10

user2741956


People also ask

How do I sum a column in a Dataframe in pandas?

Pandas: How to Sum Columns Based on a Condition You can use the following syntax to sum the values of a column in a pandas DataFrame based on a condition: df.loc[df ['col1'] == some_value, 'col2'].sum() This tutorial provides several examples of how to use this syntax in practice using the following pandas DataFrame:

How do you calculate cumulative percentage in Python pandas?

While there is no dedicated function for calculating cumulative percentages, we can use the Pandas .cumsum () method in conjunction with the .sum () method. What we’ve done here is first calculate the cumulative sum, then divided it by the sum of the entire column.

How do I get the cumulative sum of a Dataframe?

DataFrame.cumsum(axis=None, skipna=True, *args, **kwargs) [source] ¶ Return cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum.

How to work with missing data in a pandas Dataframe?

The Pandas .cumsum () also allows you to work with missing data. To test this out, let’s first insert a missing value into our dataframe. This returns the following dataframe: The Pandas .cumsum () method has a skipna= parameter which defaults to True. What it does, is ignore those missing values (essentially treating them as zeroes).


2 Answers

There could be faster, efficient way. Here's one inefficient apply way to do would be.

In [3270]: def custcum(x):
      ...:     total = 0
      ...:     for i, v in x.iterrows():
      ...:         total += v.spend
      ...:         x.loc[i, 'cum'] = total
      ...:         if total >= v.hurdle:
      ...:            total = 0
      ...:     return x
      ...:

In [3271]: df.groupby('customer').apply(custcum)
Out[3271]:
  customer  spend  hurdle    cum
0        A     20      50   20.0
1        A     31      50   51.0
2        A     20      50   20.0
3        B     50     100   50.0
4        B     51     100  101.0
5        B     30     100   30.0

You may consider using cython or numba to speed up the custcum


[Update]

Improved version of Ido s answer.

In [3276]: s = df.groupby('customer').spend.cumsum()

In [3277]: np.where(s > df.hurdle.shift(-1), s, df.spend)
Out[3277]: array([ 20,  51,  20,  50, 101,  30], dtype=int64)
like image 72
Zero Avatar answered Oct 31 '22 21:10

Zero


One way would be the below code. But it's a really inefficient and inelegant one-liner.

df1.groupby('customer').apply(lambda x: (x['spend'].cumsum() *(x['spend'].cumsum() > x['hurdle']).astype(int).shift(-1)).fillna(x['spend']))
like image 43
Ido S Avatar answered Oct 31 '22 21:10

Ido S