Using cumsum in pandas on group()

Q: What is possible using Groupby () method of pandas?

groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.

Q: What does Cumsum do in pandas?

Pandas DataFrame cumsum() Method The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

Q: Can you use Groupby with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Tags:

python

pandas

group-by

From a Pandas newbie: I have data that looks essentially like this -

 data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]: 
           Bool  Data Dir
2000-12-30    Y     4   E
2000-12-30    N     5   E
2000-12-30    Y     6   W
2001-01-02    N     7   W
2001-01-03    Y     8   E
2001-01-03    N     9   W
2000-12-30    Y    10   W
2000-12-30    N    11   E

And I want to group it by multiple levels, then do a cumsum():

E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum() <-(Doesn't work)

with output that would look something like:

Bool Dir Date        running_sum
N    E   2000-12-30           16
     W   2001-01-02            7
         2001-01-03           16
Y    E   2000-12-30            4
         2001-01-03           12
     W   2000-12-30           16

My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.

Thanks for any help you can give.

357

asked Apr 02 '13 02:04

msteen

1 Answers

Try this:

data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"])   # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()

The reason you cannot simply use cumsum on data3 has to do with how your data is structured. Grouping by Bool and Dir and applying an aggregation function (sum, mean, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsum is not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum, it will throw an error. That's why I called sum first, which returns a DataFrame in the correct input format.

Sorry if I haven't explained this well enough. Maybe someone else could help me out?

105

answered Sep 17 '22 20:09

bdiamante

Related questions
                            
                                TypeError: unbound method "method name" must be called with "Class name" instance as first argument (got str instance instead)
                            
                                How do you use tornado.testing for creating WebSocket unit tests?
                            
                                How to create Celery Windows Service?
                            
                                Find all tables in html using BeautifulSoup
                            
                                What is a subtraction function that is similar to sum() for subtracting items in list?
                            
                                How to set the foreign key to a default value on delete?
                            
                                How do I split models.py into different files for different models in Pyramid?
                            
                                Map different URLs to same view
                            
                                Greater than less than, python
                            
                                Danger of mixing numpy matrix and array
                            
                                Use different .ini file for alembic.ini
                            
                                Get joined string from list of lists of strings in Python
                            
                                Is Python's bool sorting defined?
                            
                                create new list without changing the original list
                            
                                How to set default value for FloatField in django model
                            
                                Python equivalent of sum() using xor()
                            
                                Autoincrementing option for Pandas DataFrame index
                            
                                Generating postgresql user password
                            
                                Simple example of using wx.TextCtrl and display data after button click in wxpython - new to wx
                            
                                How can I serve files with UTF-8 encoding using Python SimpleHTTPServer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With