Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - dataframe groupby - how to get sum of multiple columns

This should be an easy one, but somehow I couldn't find a solution that works.

I have a pandas dataframe which looks like this:

index col1   col2   col3   col4   col5
0     a      c      1      2      f 
1     a      c      1      2      f
2     a      d      1      2      f
3     b      d      1      2      g
4     b      e      1      2      g
5     b      e      1      2      g

I want to group by col1 and col2 and get the sum() of col3 and col4. col5 can be dropped since the data can not be aggregated.

Here is what the output should look like. I am interested in having both col3 and col4 in the resulting dataframe. It doesn't really matter if col1 and col2 are part of the index or not.

index col1   col2   col3   col4   
0     a      c      2      4          
1     a      d      1      2      
2     b      d      1      2      
3     b      e      2      4      
  

Here is what I tried:

df_new = df.groupby(['col1', 'col2'])['col3', 'col4'].sum()

That however only returns the aggregated results of col4.

I am lost here. Every example I found only aggregates one column, where the issue obviously doesn't occur.

like image 599
Axel Avatar asked Sep 26 '17 16:09

Axel


People also ask

How do I get the sum of multiple columns in pandas?

To sum pandas DataFrame columns (given selected multiple columns) using either sum() , iloc[] , eval() and loc[] functions. Among these pandas DataFrame. sum() function returns the sum of the values for the requested axis, In order to calculate the sum of columns use axis=1 .

Can you use Groupby with multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

How do you count in Groupby pandas?

Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.

How do you do the cumulative sum of a panda?

Pandas DataFrame cumsum() Method The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.


4 Answers

By using apply

df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum()) Out[1257]:             col3  col4 col1 col2             a    c        2     4      d        1     2 b    d        1     2      e        2     4 

If you want to agg

df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}) 
like image 111
BENY Avatar answered Sep 18 '22 00:09

BENY


Another generic solution is

df.groupby(['col1','col2']).agg({'col3':'sum','col4':'sum'}).reset_index() 

This will give you the required output.

UPDATED (June 2020): Introduced in Pandas 0.25.0, Pandas has added new groupby behavior “named aggregation” and tuples, for naming the output columns when applying multiple aggregation functions to specific columns.

df.groupby(['col1','col2']).agg(      sum_col3 = ('col3','sum'),      sum_col4 = ('col4','sum'),      ).reset_index() 

Also, you can name new columns, e.g. I've used 'sum_col3' and 'sum_col4', but you can use any name you want.

Refer to Link for detailed description.

like image 29
Prateek Sharma Avatar answered Sep 20 '22 00:09

Prateek Sharma


Due to pandas FutureWarning: Indexing with multiple keys discussed on GitHub and Stack Overflow, I recommend this solution:

df.groupby(['col1', 'col2'])[['col3', 'col4']].sum().reset_index()

Output:

output dataframe

like image 42
oil_lamp Avatar answered Sep 18 '22 00:09

oil_lamp


The above answer didn't work for me.

df_new = df.groupby(['col1', 'col2']).sum()[["col3", "col4"]]

I was grouping by single group by and sum columns.

Here is the one worked for me.

D1.groupby(['col1'])['col2'].sum() << The sum at the end not the middle.
like image 45
Leo James Avatar answered Sep 18 '22 00:09

Leo James