aggregate a matrix (or data.frame) by column name groups in R

Question

I have a large matrix with about 3000 columns x 3000 rows. I'd like to aggregate (calculate the mean) grouped by column names for every row. Each column is named similar to this method...(and in random order)

 Tree Tree House House Tree Car Car House

I would need the data result (aggregation of mean of every row) to have the following columns:

  Tree House Car

the tricky part (at least for me) is that I do not know all the column names and they are all in random order!

akrun · Accepted Answer

You could try

res1 <- vapply(unique(colnames(m1)), function(x) 
      rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                             numeric(nrow(m1)) )

Or

res2 <-  sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )

identical(res1,res2)
#[1] TRUE

Another option might be to reshape into long form and then do the aggregation

 library(data.table)
 res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
 identical(res1, as.matrix(res3))
 [1] TRUE

Benchmarks

It seems like the first two methods are slightly faster for a 3000*3000 matrix

set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE), 
   ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
    3000,replace=TRUE)))

library(microbenchmark)

f1 <-function() {vapply(unique(colnames(m1)), function(x) 
     rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                           numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}

f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
            Var1:= NULL]}

microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
#   Unit: relative
# expr      min       lq     mean   median       uq      max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004    10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531    10

data

 set.seed(24)
 m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10, 
     dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))

aggregate a matrix (or data.frame) by column name groups in R

Tags:

r

aggregate

mean

David

1 Answers

Benchmarks

data

akrun

Recent Activity

Donate For Us

aggregate a matrix (or data.frame) by column name groups in R

Tags:

r

aggregate

mean

David

1 Answers

Benchmarks

data

akrun

Related questions

Recent Activity

Donate For Us