I have a pandas dataframe with two columns like this,
Item Value
0 A 7
1 A 2
2 A -6
3 A -70
4 A 8
5 A 0
I want to cumulative sum over the column, Value
. But while creating the cumulative sum if the value becomes negative I want to reset it back to 0.
I am currently using a loop shown below to perform this,
sum_ = 0
cumsum = []
for val in sample['Value'].values:
sum_ += val
if sum_ < 0:
sum_ = 0
cumsum.append(sum_)
print(cumsum) # [7, 9, 3, 0, 8, 8]
I am looking for a more efficient way to perform this in pure pandas.
Slightly modify also this method is slow that numba
solution
sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)
numba
solution
from numba import njit
@njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
total += y
if total < lim:
total = 0
result.append(total)
return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]
This is only a comment WeNYoBen.
If you can avoid lists it is usually recommendable to avoid it.
Example
from numba import njit
import numpy as np
#with lists
@njit()
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
total += y
if total < lim:
total = 0
result.append(total)
return result
#without lists
@njit()
def cumli_2(x, lim):
total = 0.
result = np.empty_like(x)
for i, y in enumerate(x):
total += y
if total < lim:
total = 0.
result[i]=total
return result
Timings
Without Numba (comment out@njit()):
x=(np.random.rand(1_000)-0.5)*5
%timeit a=cumli(x, 0.)
220 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a=cumli_2(x, 0.)
227 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There is no difference between using lists or arrays. But that's not the case if you Jit-compile this function.
With Numba:
%timeit a=cumli(x, 0.)
27.4 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit a=cumli_2(x, 0.)
2.96 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even in a bit more complicated cases (final array size unknown, or only max array size known) it often makes sense to allocate an array and shrink it at the end, or in simple cases even to run the algorithm once to know the final array size and than do the real calculation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With