Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using cumsum in pandas on group()

From a Pandas newbie: I have data that looks essentially like this -

 data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]: 
           Bool  Data Dir
2000-12-30    Y     4   E
2000-12-30    N     5   E
2000-12-30    Y     6   W
2001-01-02    N     7   W
2001-01-03    Y     8   E
2001-01-03    N     9   W
2000-12-30    Y    10   W
2000-12-30    N    11   E

And I want to group it by multiple levels, then do a cumsum():

E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum() <-(Doesn't work)

with output that would look something like:

Bool Dir Date        running_sum
N    E   2000-12-30           16
     W   2001-01-02            7
         2001-01-03           16
Y    E   2000-12-30            4
         2001-01-03           12
     W   2000-12-30           16

My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.

Thanks for any help you can give.

like image 357
msteen Avatar asked Apr 02 '13 02:04

msteen


People also ask

What is possible using Groupby () method of pandas?

groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.

What does Cumsum do in pandas?

Pandas DataFrame cumsum() Method The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

Can you use Groupby with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.


1 Answers

Try this:

data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"])   # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()

The reason you cannot simply use cumsum on data3 has to do with how your data is structured. Grouping by Bool and Dir and applying an aggregation function (sum, mean, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsum is not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum, it will throw an error. That's why I called sum first, which returns a DataFrame in the correct input format.

Sorry if I haven't explained this well enough. Maybe someone else could help me out?

like image 105
bdiamante Avatar answered Sep 17 '22 20:09

bdiamante