Consider my dataframe, <code>df</code>: <pre class="prettyprint"><code>data data_binary sum_data 2 1 1 5 0 0 1 1 1 4 1 2 3 1 3 10 0 0 7 0 0 3 1 1 </code></pre> How can I calculate the cumulative sum of <code>data_binary</code> within groups of contiguous <code>1</code> values? The first group of <code>1</code>'s had a single <code>1</code> and <code>sum_data</code> has only a <code>1</code>. However, the second group of <code>1</code>'s has 3 <code>1</code>'s and <code>sum_data</code> is <code>[1, 2, 3]</code>. I've tried using <code>np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0)</code>, but that returns <pre class="prettyprint"><code>array([1, 0, 2, 3, 4, 0, 0, 5]) </code></pre> Which is not what I want.

You want to take the cumulative sum of <code>data_binary</code> and subtract the most recent cumulative sum where <code>data_binary</code> was zero. <pre class="prettyprint"><code>b = df.data_binary c = b.cumsum() c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int) </code></pre> <h3>Output</h3> <pre class="prettyprint lang-none prettyprint-override"><code>0 1 1 0 2 1 3 2 4 3 5 0 6 0 7 1 Name: data_binary, dtype: int64 </code></pre> <hr> Explanation Let's start by looking at each step side by side <pre class="prettyprint"><code>cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result'] print(pd.concat([ b, c, c.mask(b != 0), c.mask(b != 0).ffill(), c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int) ], axis=1, keys=cols)) </code></pre> <h3>Output</h3> <pre class="prettyprint lang-none prettyprint-override"><code>data_binary cumulative_sum nan_non_zero forward_fill final_result 0 1 1 NaN NaN 1 1 0 1 1.0 1.0 0 2 1 2 NaN 1.0 1 3 1 3 NaN 1.0 2 4 1 4 NaN 1.0 3 5 0 4 4.0 4.0 0 6 0 4 4.0 4.0 0 7 1 5 NaN 4.0 1 </code></pre> The problem with <code>cumulative_sum</code> is that the rows where <code>data_binary</code> is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when <code>data_binary</code> is zero? Easy! I slice the cumulative sum where <code>data_binary</code> is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.

I think you can <code>groupby</code> with <code>DataFrameGroupBy.cumsum</code> by <code>Series</code>, where first compare the next value by the <code>shift</code>ed column if not equal (<code>!=</code>) and then create groups by <code>cumsum</code>. Last, replace <code>0</code> by column <code>data_binary</code> with <code>mask</code>: <pre class="prettyprint"><code>print (df.data_binary.ne(df.data_binary.shift()).cumsum()) 0 1 1 2 2 3 3 3 4 3 5 4 6 4 7 5 Name: data_binary, dtype: int32 df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum()) .cumsum() df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0) print (df) data data_binary sum_data sum_data1 0 2 1 1 1 1 5 0 0 0 2 1 1 1 1 3 4 1 2 2 4 3 1 3 3 5 10 0 0 0 6 7 0 0 0 7 3 1 1 1 </code></pre>

python pandas conditional cumulative sum

Tags:

python-3.x

pandas

dataframe

ipython

Consider my dataframe, df:

data  data_binary  sum_data
  2       1            1
  5       0            0
  1       1            1
  4       1            2
  3       1            3
  10      0            0
  7       0            0
  3       1            1

How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?

The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].

I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns

array([1, 0, 2, 3, 4, 0, 0, 5])

Which is not what I want.

950

asked Jan 02 '17 02:01

GrayHash

2 Answers

You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.

b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)

Output

0    1
1    0
2    1
3    2
4    3
5    0
6    0
7    1
Name: data_binary, dtype: int64

Explanation

Let's start by looking at each step side by side

cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
        b, c,
        c.mask(b != 0),
        c.mask(b != 0).ffill(),
        c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
    ], axis=1, keys=cols))

Output

data_binary  cumulative_sum  nan_non_zero  forward_fill  final_result
0            1               1             NaN           NaN             1
1            0               1             1.0           1.0             0
2            1               2             NaN           1.0             1
3            1               3             NaN           1.0             2
4            1               4             NaN           1.0             3
5            0               4             4.0           4.0             0
6            0               4             4.0           4.0             0
7            1               5             NaN           4.0             1

The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.

119

answered Sep 29 '22 16:09

piRSquared

I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:

print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: data_binary, dtype: int32

df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
                                .cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
   data  data_binary  sum_data  sum_data1
0     2            1         1          1
1     5            0         0          0
2     1            1         1          1
3     4            1         2          2
4     3            1         3          3
5    10            0         0          0
6     7            0         0          0
7     3            1         1          1

answered Sep 29 '22 16:09

jezrael

Related questions
                            
                                When and why should I use attr.Factory?
                            
                                Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)
                            
                                ModuleNotFoundError: No module named 'imblearn'
                            
                                count total number of list elements in pandas column
                            
                                How to bind (authenticate) a user with ldap3 in python3
                            
                                Pythonic way to ensure unicode in python 2 and 3
                            
                                What is the best interface from Python 3.1.1 to R?
                            
                                Download progressbar for Python 3
                            
                                Python code works, but eclipse shows error - Syntax error while detecting tuple
                            
                                Possible to add newline to .format() method?
                            
                                How to use a <ComboboxSelected> virtual event with tkinter
                            
                                How can I use tensorboard with tf.estimator.Estimator
                            
                                brew install doesn't link python3
                            
                                Cannot import pywinauto on Windows 10
                            
                                Python3 sleep() problem
                            
                                Can a from __future__ import ... guarantee Python 2 and 3 compatibility?
                            
                                Nested List Indices [duplicate]
                            
                                Python 3.2 input date function
                            
                                sqlite3, IntegrityError: UNIQUE constraint failed when inserting a value
                            
                                How to set coordinates when cropping an image with PIL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python pandas conditional cumulative sum

Tags:

python-3.x

pandas

dataframe

ipython

GrayHash

People also ask

2 Answers

Output

Output

piRSquared

jezrael

Recent Activity

Donate For Us