Consider my dataframe, df
:
data data_binary sum_data
2 1 1
5 0 0
1 1 1
4 1 2
3 1 3
10 0 0
7 0 0
3 1 1
How can I calculate the cumulative sum of data_binary
within groups of contiguous 1
values?
The first group of 1
's had a single 1
and sum_data
has only a 1
. However, the second group of 1
's has 3 1
's and sum_data
is [1, 2, 3]
.
I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0)
, but that returns
array([1, 0, 2, 3, 4, 0, 0, 5])
Which is not what I want.
Pandas Series: cumsum() function The cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum. The index or the name of the axis. 0 is equivalent to None or 'index'.
Pandas DataFrame sum() MethodThe sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.
A running total, or cumulative sum, is a sequence of partial sums of a given data set. It is used to show the summation of data as it grows with time (updated every time a new number is added to the sequence).
To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.
You want to take the cumulative sum of data_binary
and subtract the most recent cumulative sum where data_binary
was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum
is that the rows where data_binary
is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary
is zero? Easy! I slice the cumulative sum where data_binary
is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
I think you can groupby
with DataFrameGroupBy.cumsum
by Series
, where first compare the next value by the shift
ed column if not equal (!=
) and then create groups by cumsum
. Last, replace 0
by column data_binary
with mask
:
print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: data_binary, dtype: int32
df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
.cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
data data_binary sum_data sum_data1
0 2 1 1 1
1 5 0 0 0
2 1 1 1 1
3 4 1 2 2
4 3 1 3 3
5 10 0 0 0
6 7 0 0 0
7 3 1 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With