Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cumsum per group in column ordered by second column append to original dataframe

I have a dataframe of 2 columns with multiple groups, in this case, column 1 with groups, and column 2 groups, (in this case dates which must be in correct order). I want a cumulative summation of the third column per each unique group in col1 ordered by col2.

I have yet to find an elegant method to accomplish this. The following code will correctly accomplish the task, but sums the whole table, and is very dependent on getting all of the sorts correct.

multidf = df.groupby(by=['col1','col2']).sum().groupby(level=[0]).cumsum() Will give a hierarchical index solution, but I could not find a way to take the resulting cumulative sum column and attach it to the original dataframe without multistep merge commands. The reset_index command just converts the table back to its original state.

Looking for a better method to achieve desired output as shown. I have messed with lambda, apply, aggregrate commands can't quite get anything to work.

In[229]: df#example dataframe
Out[229]: 
  col1                 col2  col3
0    a  2013/01/03 00:00:00     1
1    a  2013/03/05 09:43:31     3
2    b  2013/03/07 00:00:00     4
3    b  2013/03/07 00:00:00     2
4    a  2013/03/07 00:00:00     0

import numpy as np
import pandas as pd
##example dataframe
df = pd.DataFrame({'col1' : ['a','a','b','b','a'],
               'col2' : ['2013/01/03 00:00:00', '2013/03/05 09:43:31', '2013/03/07 00:00:00',\
                         '2013/03/07 00:00:00', '2013/03/07 00:00:00'],
               'col3' : [1,3,4,2,0]})
df = df.sort(['col1','col2'])
jj= df.groupby(by = ['col1'],sort=['col1','col2']).cumsum()

df = df.sort(['col1','col2'])

##multi alternative, can't get result back into original df elegantly
multidf = df.groupby(by=['col1','col2']).sum().groupby(level=[0]).cumsum()

df['cumsum'] = jj['col3']
In[227]: df ## result of unelegant method, desired output though how else can i achieve this?
Out[227]: 
  col1                 col2  col3 cumsum
0    a  2013/01/03 00:00:00     1      1
1    a  2013/03/05 09:43:31     3      4
4    a  2013/03/07 00:00:00     0      4
2    b  2013/03/07 00:00:00     4      4
3    b  2013/03/07 00:00:00     2      6
like image 850
user1620461 Avatar asked Sep 30 '22 13:09

user1620461


1 Answers

It would be easier to set the column you want to sort as index. Then use groupby.transform(pd.Series.cumsum). See below:

In [1]: df
Out[1]: 
  col1                 col2  col3
0    a  2013/01/03 00:00:00     1
1    a  2013/03/05 09:43:31     3
2    b  2013/03/07 00:00:00     4
3    b  2013/03/07 00:00:00     2
4    a  2013/03/07 00:00:00     0

In [2]: df1=df.set_index('col2').sort_index()
In [3]: df1
Out[3]: 
                    col1  col3
col2                          
2013/01/03 00:00:00    a     1
2013/03/05 09:43:31    a     3
2013/03/07 00:00:00    b     4
2013/03/07 00:00:00    b     2
2013/03/07 00:00:00    a     0


In [4]: df1['cumsum']=df1.groupby('col1')['col3'].transform(pd.Series.cumsum)

In [5]: df1
Out[5]: 
                    col1  col3  cumsum
col2                                  
2013/01/03 00:00:00    a     1       1
2013/03/05 09:43:31    a     3       4
2013/03/07 00:00:00    b     4       4
2013/03/07 00:00:00    b     2       6
2013/03/07 00:00:00    a     0       4
like image 185
user3576212 Avatar answered Oct 03 '22 08:10

user3576212