Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply several summary functions on several variables by group in one call

Tags:

r

r-faq

aggregate

I have the following data frame

x <- read.table(text = "  id1 id2 val1 val2 1   a   x    1    9 2   a   x    2    4 3   a   y    3    5 4   a   y    4    9 5   b   x    1    7 6   b   y    4    4 7   b   x    3    9 8   b   y    2    8", header = TRUE) 

I want to calculate the mean of val1 and val2 grouped by id1 and id2, and simultaneously count the number of rows for each id1-id2 combination. I can perform each calculation separately:

# calculate mean aggregate(. ~ id1 + id2, data = x, FUN = mean)  # count rows aggregate(. ~ id1 + id2, data = x, FUN = length) 

In order to do both calculations in one call, I tried

do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x)))) 

However, I get a garbled output along with a warning:

#     m   n # id1 1   2 # id2 1   1 #     1.5 2 #     2   2 #     3.5 2 #     3   2 #     6.5 2 #     8   2 #     7   2 #     6   2 # Warning message: #   In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( : #   number of columns of result is not a multiple of vector length (arg 1) 

I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

How can I use aggregate or other functions to perform several calculations in one call?

like image 671
broccoli Avatar asked Aug 21 '12 22:08

broccoli


People also ask

Which function is used to aggregate values from multiple columns in to one?

We can use the aggregate() function in R to produce summary statistics for one or more variables in a data frame.

How do you use an aggregate function in R?

In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean ) on the third. An alternative is to specify a formula of the form: numerical ~ categorical .


1 Answers

You can do it all in one step and get proper labeling:

> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) #   id1 id2 val1.mn val1.n val2.mn val2.n # 1   a   x     1.5    2.0     6.5    2.0 # 2   b   x     2.0    2.0     8.0    2.0 # 3   a   y     3.5    2.0     7.0    2.0 # 4   b   y     3.0    2.0     6.0    2.0 

This creates a dataframe with two id columns and two matrix columns:

str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 'data.frame':   4 obs. of  4 variables:  $ id1 : Factor w/ 2 levels "a","b": 1 2 1 2  $ id2 : Factor w/ 2 levels "x","y": 1 1 2 2  $ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2   ..- attr(*, "dimnames")=List of 2   .. ..$ : NULL   .. ..$ : chr  "mn" "n"  $ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2   ..- attr(*, "dimnames")=List of 2   .. ..$ : NULL   .. ..$ : chr  "mn" "n" 

As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)

str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )      ) 'data.frame':   4 obs. of  6 variables:  $ id1    : Factor w/ 2 levels "a","b": 1 2 1 2  $ id2    : Factor w/ 2 levels "x","y": 1 1 2 2  $ val1.mn: num  1.5 2 3.5 3  $ val1.n : num  2 2 2 2  $ val2.mn: num  6.5 8 7 6  $ val2.n : num  2 2 2 2 

This is the syntax for multiple variables on the LHS:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) 
like image 152
IRTFM Avatar answered Sep 26 '22 03:09

IRTFM