Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate sum and mean in R with ddply

Tags:

r

plyr

My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds from ggplot2.

I know I could do it like this:

ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))

But while it is reasonable for 3 columns, it is unacceptable for 17 of them.

When researching this, I found the colwise function, but the best I came up with is this:

cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])

Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):

ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))

or:

ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))

To sum up:

  • I don't want to have to type all 17 columns explicitly, like the first example does with x, y, and z.
  • Ideally, I would like to do it with a single call to ddply, without resorting to cbind (or similar functions), as in the second example.

For reference, the result I expect is 5 rows and 5 columns:

        cut         x         y        z    price
1      Fair  10057.50   9954.07  6412.26 4358.758
2      Good  28645.08  28703.75 17855.42 3928.864
3 Very Good  69359.09  69713.45 43009.52 3981.760
4   Premium  82385.88  81985.82 50297.49 4584.258
5     Ideal 118691.07 118963.24 73304.61 3457.542
like image 543
arekolek Avatar asked Nov 13 '15 22:11

arekolek


People also ask

What does Ddply do in R?

Today we will emphasize ddply() which accepts a data. frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data. frame. For the record, the built-in functions most relevant to ddply() are tapply() and friends.

How do I sum by group in R?

Now we can use the group_by and the summarise_at functions to get the summation by group: iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal. Length), # Specify column list(name = sum)) # Specify function # A tibble: 3 x 2 # Species name # <fct> <dbl> # 1 setosa 250.

How do I sum a column by year in R?

To find the total by year column in an R data frame, we can use aggregate function with sum function.


4 Answers

I would like to suggest data.table solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.

Predifine column names

Sums <- 7:9
Means <- "price"

Run the code

library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
                         lapply(.SD[, Means, with = FALSE], mean))
                     , by = cut]

#          cut         x         y        z    price
# 1:     Ideal 118691.07 118963.24 73304.61 3457.542
# 2:   Premium  82385.88  81985.82 50297.49 4584.258
# 3:      Good  28645.08  28703.75 17855.42 3928.864
# 4: Very Good  69359.09  69713.45 43009.52 3981.760
# 5:      Fair  10057.50   9954.07  6412.26 4358.758

For your specific example, this could simplified to just

data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
#          cut         x         y        z       pe
# 1:     Ideal 118691.07 118963.24 73304.61 3457.542
# 2:   Premium  82385.88  81985.82 50297.49 4584.258
# 3:      Good  28645.08  28703.75 17855.42 3928.864
# 4: Very Good  69359.09  69713.45 43009.52 3981.760
# 5:      Fair  10057.50   9954.07  6412.26 4358.758
like image 172
David Arenburg Avatar answered Oct 13 '22 08:10

David Arenburg


Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.

library(dplyr)
library(ggplot2)

diamonds %>%
    group_by(cut) %>%
    summarise_each(funs(sum, mean), x:z, price) %>%
    select(cut, matches("[xyz]_sum"), price_mean)
like image 42
MarkusN Avatar answered Oct 13 '22 06:10

MarkusN


Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n!)

nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1

or the more generic,

do.call(merge, 
    lapply(c(colwise(sum, 7:9), colwise(mean, 6)), 
           function(cw) ddply(diamonds, ~cut, cw)))
like image 26
jMathew Avatar answered Oct 13 '22 06:10

jMathew


Just to throw in another solution:

library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)

makeList <- function(inL, mdat = diamonds, by = ~cut) {
   colN <- names(mdat)
   args <- unlist(llply(names(inL), function(n) {
      llply(inL[[n]], function(x) {
         ret <- list(call(n, as.symbol(colN[[x]])))
         names(ret) <- paste(n, colN[[x]], sep = ".")
         ret
      })
   }))
   args$.data <- as.symbol(deparse(substitute(mdat)))
   args$.variables <- by
   args$.fun <- as.symbol("summarise")
   args
}

do.call(ddply, makeList(trans))
#         cut   mean.x   mean.y   mean.z sum.price
# 1      Fair 6.246894 6.182652 3.982770   7017600
# 2      Good 5.838785 5.850744 3.639507  19275009
# 3 Very Good 5.740696 5.770026 3.559801  48107623
# 4   Premium 5.973887 5.944879 3.647124  63221498
# 5     Ideal 5.507451 5.520080 3.401448  74513487

The idea is that the function makeList creates an argument list for ddply. In this way you can quite easily add terms to the list (as function.name = column.indices) and ddply will work as expected:

trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
#         cut   mean.x   mean.y   mean.z sum.price      sd.y      sd.z
# 1      Fair 6.246894 6.182652 3.982770   7017600 0.9563804 0.6516384
# 2      Good 5.838785 5.850744 3.639507  19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801  48107623 1.1029236 0.7302281
# 4   Premium 5.973887 5.944879 3.647124  63221498 1.2597511 0.7311610
# 5     Ideal 5.507451 5.520080 3.401448  74513487 1.0744953 0.6576481
like image 2
thothal Avatar answered Oct 13 '22 06:10

thothal