If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))
    0
0   0
1   2
2   8
3   1
4   0
5   0
6   7
7   0
8   2
9   2
Is there an efficient way cumsum rows with a limit and each time this limit is reached, to start a new cumsum. After each limit is reached (however many rows), a row is created with the total cumsum.
Below I have created an example of a function that does this, but it's very slow, especially when the dataframe becomes very large. I don't like that my function is looping and I am looking for a way to make it faster (I guess a way without a loop).
def foo(df, max_value):
    last_value = 0
    storage = []
    for index, row in df.iterrows():
        this_value = np.nansum([row[0], last_value])
        if this_value >= max_value:
            storage.append((index, this_value))
            this_value = 0
        last_value = this_value
    return storage
If you rum my function like so: foo(df, 5)
In in the above context, it returns:
   0
2  10
6  8
The loop cannot be avoided, but it can be parallelized using numba's njit:
from numba import njit, prange
@njit
def dynamic_cumsum(seq, index, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([index[i], running])
            running = 0
        running += seq[i] 
    cumsum.append([index[-1], running])
    return cumsum
The index is required here, assuming your index is not numeric/monotonically increasing.
%timeit foo(df, 5)
1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If the index is of Int64Index type, you can shorten this to:
@njit
def dynamic_cumsum2(seq, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([i, running])
            running = 0
        running += seq[i] 
    cumsum.append([i, running])
    return cumsum
lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
    B
A    
3  10
7   8
9   4
%timeit foo(df, 5)
1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit Functions Performance 
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
    kernels=[
        lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
        lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
    ],
    labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
    n_range=[2**k for k in range(0, 17)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=None # TODO - update when @jpp adds in the final `yield`
)
The log-log plot shows that the generator function is faster for larger inputs:

A possible explanation is that, as N increases, the overhead of appending to a growing list in dynamic_cumsum2 becomes prominent. While cumsum_limit_nb just has to yield.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With