My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, <code>diamonds</code> from <code>ggplot2</code>. I know I could do it like this: <pre class="prettyprint"><code>ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price)) </code></pre> But while it is reasonable for 3 columns, it is unacceptable for 17 of them. When researching this, I found the <code>colwise</code> function, but the best I came up with is this: <pre class="prettyprint"><code>cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2]) </code></pre> Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands): <pre class="prettyprint"><code>ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price)) </code></pre> or: <pre class="prettyprint"><code>ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price)) </code></pre> To sum up: <ul> <li>I don't want to have to type all 17 columns explicitly, like the first example does with <code>x</code>, <code>y</code>, and <code>z</code>.</li> <li>Ideally, I would like to do it with a single call to <code>ddply</code>, without resorting to <code>cbind</code> (or similar functions), as in the second example.</li> </ul> For reference, the result I expect is 5 rows and 5 columns: <pre class="prettyprint"><code> cut x y z price 1 Fair 10057.50 9954.07 6412.26 4358.758 2 Good 28645.08 28703.75 17855.42 3928.864 3 Very Good 69359.09 69713.45 43009.52 3981.760 4 Premium 82385.88 81985.82 50297.49 4584.258 5 Ideal 118691.07 118963.24 73304.61 3457.542 </code></pre>

Antoher solution using <code>dplyr</code>. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination. <pre class="prettyprint"><code>library(dplyr) library(ggplot2) diamonds %>% group_by(cut) %>% summarise_each(funs(sum, mean), x:z, price) %>% select(cut, matches("[xyz]_sum"), price_mean) </code></pre>

Yet another approach (in my opinion easier to read) for your particular case (<code>mean = sum/n</code>!) <pre class="prettyprint"><code>nCut <- ddply(diamonds, ~cut, nrow) res <- ddply(diamonds, ~cut, colwise(sum, 6:9)) res$price <- res$price/nCut$V1 </code></pre> or the more generic, <pre class="prettyprint"><code>do.call(merge, lapply(c(colwise(sum, 7:9), colwise(mean, 6)), function(cw) ddply(diamonds, ~cut, cw))) </code></pre>

Just to throw in another solution: <pre class="prettyprint"><code>library(plyr) library(ggplot2) trans <- list(mean = 8:10, sum = 7) makeList <- function(inL, mdat = diamonds, by = ~cut) { colN <- names(mdat) args <- unlist(llply(names(inL), function(n) { llply(inL[[n]], function(x) { ret <- list(call(n, as.symbol(colN[[x]]))) names(ret) <- paste(n, colN[[x]], sep = ".") ret }) })) args$.data <- as.symbol(deparse(substitute(mdat))) args$.variables <- by args$.fun <- as.symbol("summarise") args } do.call(ddply, makeList(trans)) # cut mean.x mean.y mean.z sum.price # 1 Fair 6.246894 6.182652 3.982770 7017600 # 2 Good 5.838785 5.850744 3.639507 19275009 # 3 Very Good 5.740696 5.770026 3.559801 48107623 # 4 Premium 5.973887 5.944879 3.647124 63221498 # 5 Ideal 5.507451 5.520080 3.401448 74513487 </code></pre> The idea is that the function <code>makeList</code> creates an argument list for <code>ddply</code>. In this way you can quite easily add terms to the list (as <code>function.name = column.indices</code>) and <code>ddply</code> will work as expected: <pre class="prettyprint"><code>trans <- c(trans, sd = list(9:10)) do.call(ddply, makeList(trans)) # cut mean.x mean.y mean.z sum.price sd.y sd.z # 1 Fair 6.246894 6.182652 3.982770 7017600 0.9563804 0.6516384 # 2 Good 5.838785 5.850744 3.639507 19275009 1.0515353 0.6548925 # 3 Very Good 5.740696 5.770026 3.559801 48107623 1.1029236 0.7302281 # 4 Premium 5.973887 5.944879 3.647124 63221498 1.2597511 0.7311610 # 5 Ideal 5.507451 5.520080 3.401448 74513487 1.0744953 0.6576481 </code></pre>

Aggregate sum and mean in R with ddply

Tags:

r

plyr

My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds from ggplot2.

I know I could do it like this:

ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))

But while it is reasonable for 3 columns, it is unacceptable for 17 of them.

When researching this, I found the colwise function, but the best I came up with is this:

cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])

Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):

ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))

or:

ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))

To sum up:

I don't want to have to type all 17 columns explicitly, like the first example does with x, y, and z.
Ideally, I would like to do it with a single call to ddply, without resorting to cbind (or similar functions), as in the second example.

For reference, the result I expect is 5 rows and 5 columns:

        cut         x         y        z    price
1      Fair  10057.50   9954.07  6412.26 4358.758
2      Good  28645.08  28703.75 17855.42 3928.864
3 Very Good  69359.09  69713.45 43009.52 3981.760
4   Premium  82385.88  81985.82 50297.49 4584.258
5     Ideal 118691.07 118963.24 73304.61 3457.542

543

asked Nov 13 '15 22:11

arekolek

4 Answers

I would like to suggest data.table solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.

Predifine column names

Sums <- 7:9
Means <- "price"

Run the code

library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
                         lapply(.SD[, Means, with = FALSE], mean))
                     , by = cut]

#          cut         x         y        z    price
# 1:     Ideal 118691.07 118963.24 73304.61 3457.542
# 2:   Premium  82385.88  81985.82 50297.49 4584.258
# 3:      Good  28645.08  28703.75 17855.42 3928.864
# 4: Very Good  69359.09  69713.45 43009.52 3981.760
# 5:      Fair  10057.50   9954.07  6412.26 4358.758

For your specific example, this could simplified to just

data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
#          cut         x         y        z       pe
# 1:     Ideal 118691.07 118963.24 73304.61 3457.542
# 2:   Premium  82385.88  81985.82 50297.49 4584.258
# 3:      Good  28645.08  28703.75 17855.42 3928.864
# 4: Very Good  69359.09  69713.45 43009.52 3981.760
# 5:      Fair  10057.50   9954.07  6412.26 4358.758

172

answered Oct 13 '22 08:10

David Arenburg

Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.

library(dplyr)
library(ggplot2)

diamonds %>%
    group_by(cut) %>%
    summarise_each(funs(sum, mean), x:z, price) %>%
    select(cut, matches("[xyz]_sum"), price_mean)

answered Oct 13 '22 06:10

MarkusN

Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n!)

nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1

or the more generic,

do.call(merge, 
    lapply(c(colwise(sum, 7:9), colwise(mean, 6)), 
           function(cw) ddply(diamonds, ~cut, cw)))

answered Oct 13 '22 06:10

jMathew

Just to throw in another solution:

library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)

makeList <- function(inL, mdat = diamonds, by = ~cut) {
   colN <- names(mdat)
   args <- unlist(llply(names(inL), function(n) {
      llply(inL[[n]], function(x) {
         ret <- list(call(n, as.symbol(colN[[x]])))
         names(ret) <- paste(n, colN[[x]], sep = ".")
         ret
      })
   }))
   args$.data <- as.symbol(deparse(substitute(mdat)))
   args$.variables <- by
   args$.fun <- as.symbol("summarise")
   args
}

do.call(ddply, makeList(trans))
#         cut   mean.x   mean.y   mean.z sum.price
# 1      Fair 6.246894 6.182652 3.982770   7017600
# 2      Good 5.838785 5.850744 3.639507  19275009
# 3 Very Good 5.740696 5.770026 3.559801  48107623
# 4   Premium 5.973887 5.944879 3.647124  63221498
# 5     Ideal 5.507451 5.520080 3.401448  74513487

The idea is that the function makeList creates an argument list for ddply. In this way you can quite easily add terms to the list (as function.name = column.indices) and ddply will work as expected:

trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
#         cut   mean.x   mean.y   mean.z sum.price      sd.y      sd.z
# 1      Fair 6.246894 6.182652 3.982770   7017600 0.9563804 0.6516384
# 2      Good 5.838785 5.850744 3.639507  19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801  48107623 1.1029236 0.7302281
# 4   Premium 5.973887 5.944879 3.647124  63221498 1.2597511 0.7311610
# 5     Ideal 5.507451 5.520080 3.401448  74513487 1.0744953 0.6576481

answered Oct 13 '22 06:10

thothal

Related questions
                            
                                Write a function that returns a vector of ggplot2 graphs
                            
                                plot lines instead of points R
                            
                                generate column values with multiple conditions in R
                            
                                How to get the maximum value by group
                            
                                How to edit 300 GB text file (genomics data)?
                            
                                Histogram of uniform distribution not plotted correctly in R
                            
                                R regex find last occurrence of delimiter
                            
                                How can I automatically create n lags in a timeseries?
                            
                                How to append multiple files in R
                            
                                Sorting a table with R
                            
                                Calculate rolling difference in R
                            
                                How to add multiple columns to a tibble?
                            
                                Aggregating table() over multiple columns in R without a "by" breakdown
                            
                                How to expand a list with NULLs up to some length?
                            
                                R and matrix with 1 row
                            
                                R-Project if else syntax [duplicate]
                            
                                Multiple replacement in R
                            
                                Color in ggplot- continuous value applied to discrete variable
                            
                                R plyr, data.table, apply certain columns of data.frame
                            
                                Replace negative values by NA values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Aggregate sum and mean in R with ddply

Tags:

r

plyr

arekolek

People also ask

4 Answers

David Arenburg

MarkusN

jMathew

thothal

Recent Activity

Donate For Us