My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds
from ggplot2
.
I know I could do it like this:
ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))
But while it is reasonable for 3 columns, it is unacceptable for 17 of them.
When researching this, I found the colwise
function, but the best I came up with is this:
cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])
Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):
ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))
or:
ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))
To sum up:
x
, y
, and z
.ddply
, without resorting to cbind
(or similar functions), as in the second example.For reference, the result I expect is 5 rows and 5 columns:
cut x y z price
1 Fair 10057.50 9954.07 6412.26 4358.758
2 Good 28645.08 28703.75 17855.42 3928.864
3 Very Good 69359.09 69713.45 43009.52 3981.760
4 Premium 82385.88 81985.82 50297.49 4584.258
5 Ideal 118691.07 118963.24 73304.61 3457.542
Today we will emphasize ddply() which accepts a data. frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data. frame. For the record, the built-in functions most relevant to ddply() are tapply() and friends.
Now we can use the group_by and the summarise_at functions to get the summation by group: iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal. Length), # Specify column list(name = sum)) # Specify function # A tibble: 3 x 2 # Species name # <fct> <dbl> # 1 setosa 250.
To find the total by year column in an R data frame, we can use aggregate function with sum function.
I would like to suggest data.table
solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.
Predifine column names
Sums <- 7:9
Means <- "price"
Run the code
library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
lapply(.SD[, Means, with = FALSE], mean))
, by = cut]
# cut x y z price
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
For your specific example, this could simplified to just
data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
# cut x y z pe
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
Antoher solution using dplyr
. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n
!)
nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1
or the more generic,
do.call(merge,
lapply(c(colwise(sum, 7:9), colwise(mean, 6)),
function(cw) ddply(diamonds, ~cut, cw)))
Just to throw in another solution:
library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)
makeList <- function(inL, mdat = diamonds, by = ~cut) {
colN <- names(mdat)
args <- unlist(llply(names(inL), function(n) {
llply(inL[[n]], function(x) {
ret <- list(call(n, as.symbol(colN[[x]])))
names(ret) <- paste(n, colN[[x]], sep = ".")
ret
})
}))
args$.data <- as.symbol(deparse(substitute(mdat)))
args$.variables <- by
args$.fun <- as.symbol("summarise")
args
}
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price
# 1 Fair 6.246894 6.182652 3.982770 7017600
# 2 Good 5.838785 5.850744 3.639507 19275009
# 3 Very Good 5.740696 5.770026 3.559801 48107623
# 4 Premium 5.973887 5.944879 3.647124 63221498
# 5 Ideal 5.507451 5.520080 3.401448 74513487
The idea is that the function makeList
creates an argument list for ddply
. In this way you can quite easily add terms to the list (as function.name = column.indices
) and ddply
will work as expected:
trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price sd.y sd.z
# 1 Fair 6.246894 6.182652 3.982770 7017600 0.9563804 0.6516384
# 2 Good 5.838785 5.850744 3.639507 19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801 48107623 1.1029236 0.7302281
# 4 Premium 5.973887 5.944879 3.647124 63221498 1.2597511 0.7311610
# 5 Ideal 5.507451 5.520080 3.401448 74513487 1.0744953 0.6576481
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With