I have a data frame like:
customer spend hurdle
A 20 50
A 31 50
A 20 50
B 50 100
B 51 100
B 30 100
I want to calculate additional column for Cumulative which will reset base on the same customer when the Cumulative sum greater or equal to the hurdle like following :
customer spend hurdle Cumulative
A 20 50 20
A 31 50 51
A 20 50 20
B 50 100 50
B 51 100 101
B 30 100 30
I used the cumsum
and groupby
in pandas to but I do not know how to reset it base on the condition.
Following are the code I am currently using:
df1['cum_sum'] = df1.groupby(['customer'])['spend'].apply(lambda x: x.cumsum())
which I know it is just a normal cumulative sum. I very appreciate for your help.
Pandas: How to Sum Columns Based on a Condition You can use the following syntax to sum the values of a column in a pandas DataFrame based on a condition: df.loc[df ['col1'] == some_value, 'col2'].sum() This tutorial provides several examples of how to use this syntax in practice using the following pandas DataFrame:
While there is no dedicated function for calculating cumulative percentages, we can use the Pandas .cumsum () method in conjunction with the .sum () method. What we’ve done here is first calculate the cumulative sum, then divided it by the sum of the entire column.
DataFrame.cumsum(axis=None, skipna=True, *args, **kwargs) [source] ¶ Return cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum.
The Pandas .cumsum () also allows you to work with missing data. To test this out, let’s first insert a missing value into our dataframe. This returns the following dataframe: The Pandas .cumsum () method has a skipna= parameter which defaults to True. What it does, is ignore those missing values (essentially treating them as zeroes).
There could be faster, efficient way. Here's one inefficient apply
way to do would be.
In [3270]: def custcum(x):
...: total = 0
...: for i, v in x.iterrows():
...: total += v.spend
...: x.loc[i, 'cum'] = total
...: if total >= v.hurdle:
...: total = 0
...: return x
...:
In [3271]: df.groupby('customer').apply(custcum)
Out[3271]:
customer spend hurdle cum
0 A 20 50 20.0
1 A 31 50 51.0
2 A 20 50 20.0
3 B 50 100 50.0
4 B 51 100 101.0
5 B 30 100 30.0
You may consider using cython
or numba
to speed up the custcum
[Update]
Improved version of Ido s answer.
In [3276]: s = df.groupby('customer').spend.cumsum()
In [3277]: np.where(s > df.hurdle.shift(-1), s, df.spend)
Out[3277]: array([ 20, 51, 20, 50, 101, 30], dtype=int64)
One way would be the below code. But it's a really inefficient and inelegant one-liner.
df1.groupby('customer').apply(lambda x: (x['spend'].cumsum() *(x['spend'].cumsum() > x['hurdle']).astype(int).shift(-1)).fillna(x['spend']))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With