Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas group by cumsum keep columns

I have spent a few hours now trying to do a "cumulative group by sum" on a pandas dataframe. I have looked at all the stackoverflow answers and surprisingly none of them can solve my (very elementary) problem:

I have a dataframe:

df1 Out[8]: Name Date Amount 0 Jack 2016-01-31 10 1 Jack 2016-02-29 5 2 Jack 2016-02-29 8 3 Jill 2016-01-31 10 4 Jill 2016-02-29 5

I am trying to

  1. group by ['Name','Date'] and
  2. cumsum 'Amount'.
  3. That is it.

So the desired output is:

df1 Out[10]: Name Date Cumsum 0 Jack 2016-01-31 10 1 Jack 2016-02-29 23 2 Jill 2016-01-31 10 3 Jill 2016-02-29 15

EDIT: I am simplifying the question. With the current answers I still can't get the correct "running" cumsum. Look closely, I want to see the cumulative sum "10, 23, 10, 15". In words, I want to see, at every consecutive date, the total cumulative sum for a person. NB: If there are two entries on one date for the same person, I want to sum those and then add them to the running cumsum and only then print the sum.

like image 933
gmarais Avatar asked Jan 23 '17 14:01

gmarais


People also ask

Does Groupby maintain order?

Groupby preserves the order of rows within each group.

Can you use Groupby with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.

What does Cumsum do in pandas?

Pandas DataFrame cumsum() Method The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.


2 Answers

You need assign output to new column and then remove Amount column by drop:

df1['Cumsum'] = df1.groupby(by=['Name','Date'])['Amount'].cumsum()
df1 = df1.drop('Amount', axis=1)
print (df1)
   Name        Date  Cumsum
0  Jack  2016-01-31      10
1  Jack  2016-02-29       5
2  Jack  2016-02-29      13
3  Jill  2016-01-31      10
4  Jill  2016-02-29       5

Another solution with assign:

df1 = df1.assign(Cumsum=df1.groupby(by=['Name','Date'])['Amount'].cumsum())
         .drop('Amount', axis=1)
print (df1)
   Name        Date  Cumsum
0  Jack  2016-01-31      10
1  Jack  2016-02-29       5
2  Jack  2016-02-29      13
3  Jill  2016-01-31      10
4  Jill  2016-02-29       5

EDIT by comment:

First groupby columns Name and Date and aggregate sum, then groupby by level Name and aggregate cumsum.

df = df1.groupby(by=['Name','Date'])['Amount'].sum()
        .groupby(level='Name').cumsum().reset_index(name='Cumsum')
print (df)
   Name        Date  Cumsum
0  Jack  2016-01-31      10
1  Jack  2016-02-29      23
2  Jill  2016-01-31      10
3  Jill  2016-02-29      15
like image 68
jezrael Avatar answered Sep 27 '22 20:09

jezrael


Set the index first, then groupby.

df.set_index(['Name', 'Date']).groupby(level=[0, 1]).Amount.cumsum().reset_index()

enter image description here


After the OP changed their question, this is now the correct answer.

df1.groupby(
    ['Name','Date']
)Amount.sum().groupby(
    level='Name'
).cumsum()

This is the same answer provided by jezrael

like image 28
piRSquared Avatar answered Sep 27 '22 19:09

piRSquared