I would like to aggregate a data.frame
by an identifier variable called ensg
. The data frame looks like this:
chromosome probeset ensg symbol XXA_00 XXA_36 XXB_00
1 X 4938842 ENSMUSG00000000003 Pbsn 4.796123 4.737717 5.326664
I want to compute the mean for each numeric column over rows with same ensg
value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg
.
In the end I would like to have a data.frame
with identity columns chromosome
, ensg
, symbol
and mean of numeric columns over rows with same identifier. I implemented this in ddply
, but it is very slow when compared to aggregate
:
spec.mean <- function(eset.piece)
{
cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
}
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")
My first aggregate implementation looks like this,
mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);
and is much faster. But the problem with aggregate
is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate
since aggregate
does not pass data frames but only vectors.
Is there an elegant way to do this with aggregate
? Or is there some faster way to do it with ddply
?
If speed is a primary concern, you should take a look at the data.table
package. When the number of rows or grouping columns is large, data.table
really seems to shine. The wiki for the package is here and has several links to other good introductory documents.
Here's how you'd do this aggregation with data.table()
library(data.table)
#Turn the data.frame above into a data.table
dt <- data.table(df)
#Aggregation
dt[, list(XXA_00 = .Internal(mean(XXA_00)),
XXA_36 = .Internal(mean(XXA_36)),
XXB_00 = .Internal(mean(XXB_00))),
by = c("ensg", "chromosome", "symbol")
]
Gives us
ensg chromosome symbol XXA_00 XXA_36 XXB_00
[1,] E1 A S1 0.18026869 0.13118997 0.6558433
[2,] E2 B S2 -0.48830539 0.24235537 0.5971377
[3,] E3 C S3 -0.04786984 -0.03139901 0.5618208
The aggregate solution provided above seems to fare pretty well when working with the 30 row data.frame by comparing the output from the rbenchmark package. However, when the data.frame contains 3e5 rows, data.table()
pulls away as a clear winner. Here's the output:
benchmark(fag(), fdt(), replications = 10)
test replications elapsed relative user.self sys.self
1 fag() 10 12.71 23.98113 12.40 0.31
2 fdt() 10 0.53 1.00000 0.48 0.05
First let's define a toy example:
df <- data.frame(chromosome = gl(3, 10, labels = c('A', 'B', 'C')),
probeset = gl(3, 10, labels = c('X', 'Y', 'Z')),
ensg = gl(3, 10, labels = c('E1', 'E2', 'E3')),
symbol = gl(3, 10, labels = c('S1', 'S2', 'S3')),
XXA_00 = rnorm(30),
XXA_36 = rnorm(30),
XXB_00 = rnorm(30))
And then we use aggregate
with the formula interface:
df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,
data = df, FUN = mean)
> df1
ensg chromosome symbol XXA_00 XXA_36 XXB_00
1 E1 A S1 -0.02533499 -0.06150447 -0.01234508
2 E2 B S2 -0.25165987 0.02494902 -0.01116426
3 E3 C S3 0.09454154 -0.48468517 -0.25644569
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With