Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas conditional cumulative sum

Consider my dataframe, df:

data  data_binary  sum_data
  2       1            1
  5       0            0
  1       1            1
  4       1            2
  3       1            3
  10      0            0
  7       0            0
  3       1            1

How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?

The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].

I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns

array([1, 0, 2, 3, 4, 0, 0, 5])

Which is not what I want.

like image 950
GrayHash Avatar asked Jan 02 '17 02:01

GrayHash


People also ask

How do you do a cumulative sum in pandas?

Pandas Series: cumsum() function The cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum. The index or the name of the axis. 0 is equivalent to None or 'index'.

How do you find the sum of a column in pandas?

Pandas DataFrame sum() MethodThe sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.

How does cumulative sum work?

A running total, or cumulative sum, is a sequence of partial sums of a given data set. It is used to show the summation of data as it grows with time (updated every time a new number is added to the sequence).

How do you sum a row in Python?

To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.


2 Answers

You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.

b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)

Output

0    1
1    0
2    1
3    2
4    3
5    0
6    0
7    1
Name: data_binary, dtype: int64

Explanation

Let's start by looking at each step side by side

cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
        b, c,
        c.mask(b != 0),
        c.mask(b != 0).ffill(),
        c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
    ], axis=1, keys=cols))

Output

data_binary  cumulative_sum  nan_non_zero  forward_fill  final_result
0            1               1             NaN           NaN             1
1            0               1             1.0           1.0             0
2            1               2             NaN           1.0             1
3            1               3             NaN           1.0             2
4            1               4             NaN           1.0             3
5            0               4             4.0           4.0             0
6            0               4             4.0           4.0             0
7            1               5             NaN           4.0             1

The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.

like image 119
piRSquared Avatar answered Sep 29 '22 16:09

piRSquared


I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:

print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: data_binary, dtype: int32

df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
                                .cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
   data  data_binary  sum_data  sum_data1
0     2            1         1          1
1     5            0         0          0
2     1            1         1          1
3     4            1         2          2
4     3            1         3          3
5    10            0         0          0
6     7            0         0          0
7     3            1         1          1
like image 34
jezrael Avatar answered Sep 29 '22 16:09

jezrael