Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by multiple columns and sum other multiple columns

Tags:

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.

I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.

The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.

Is there a solution using packages data.table, plyr or any other?

like image 364
user1042267 Avatar asked Nov 21 '11 13:11

user1042267


People also ask

How do you group by and sum multiple columns in pandas?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.

How do I sum multiple columns in pandas?

To sum pandas DataFrame columns (given selected multiple columns) using either sum() , iloc[] , eval() and loc[] functions. Among these pandas DataFrame. sum() function returns the sum of the values for the requested axis, In order to calculate the sum of columns use axis=1 .

Can we use group by on multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.


2 Answers

The data.table way is :

DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)] 

or

DT[, lapply(.SD,sum), by=colnames(DT)[1:10]] 

where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)

like image 130
Matt Dowle Avatar answered Nov 10 '22 00:11

Matt Dowle


In base R this would be...

aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum) 

EDIT: The aggregate function has come a long way since I wrote this. None of the casting above is necessary.

aggregate( df[,11:200], df[,1:10], FUN = sum ) 

And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.

aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum) 

(You could use paste to construct the formula and use formula)

like image 41
John Avatar answered Nov 09 '22 22:11

John