Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - How to run average & max on different data.table columns based on multiple factors & return original colnames

I am changing my R code from data.frame + plyr to data.tables as I need a faster and more memory-efficient way to handle a big data set. Unfortunately, my R skills are woefully limited and I've hit a wall for the whole day. Would appreciate if SO experts here can enlighten.

My Goals

  • Aggregate rows in my data.table based on 2 functions - average and max - run on selected columns (with column names passed via vector) while grouping by columns also passed via vector.
  • The resulting DT should contain the original column names.
  • There should not be unnecessary copying of the DT in order to conserve memory

My Test Code

DT = data.table( a=LETTERS[c(1,1,1:4)],b=4:9, c=3:8, d = rnorm(6), 
                 e=LETTERS[c(rep(25,3),rep(26,3))], key="a" )

GrpVar1 <- "a"
GrpVar2 <- "e"
VarToMax <- "b"
VarToAve <- c( "c", "d")

What I tried but didn't work for me

DT[, list( b=max( b ), c=mean(c), d=mean(d) ), by=c( GrpVar1, GrpVar2 ) ]  
# Hard-code col name - not what I want

DT[, list( max( get(VarToMax) ), mean( get(VarToAve) )), by=c( GrpVar1, GrpVar2 ) ]  
# Col names become 'V1', 'V2', worse, 1 column goes missing - Not what I want either

DT[, list( get(VarToMax)=max( get(VarToMax) ), 
           get(VarToAve)=mean( get(VarToAve) ) ), by=c( GrpVar1, GrpVar2 ) ]
# Above code gave Error!

Additional Question

Based on my very limited understanding of DTs, the with = F argument should instruct R to parse the values of VarToMax and VarToAve, but running the code below leads to error.

DT[, list( max(VarToMax), mean(VarToAve) ), by=c( GrpVar1, GrpVar2 ), with=F ]

# Error in `[.data.table`(DT, , list(max(VarToMax), mean(VarToAve)), by = c(GrpVar1,  : 
#   object 'ansvals' not found
# In addition: Warning message:
# In mean.default(VarToAve) :
#   argument is not numeric or logical: returning NA

Existing SO solutions can't help

Arun's solution was how I got to this point, but I am very stuck. His other solution using lapply and .SDcols involves creating 2 extra DT, which does not meet my memory-conserving requirement.

dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]

I am SO confused over data.table! Any help would be most appreciated!

like image 268
NoviceProg Avatar asked Feb 02 '15 13:02

NoviceProg


1 Answers

In a similar fashion as @David Arenburg, but using .SDcols in order to simplify the notation. Also I show the code until the merge.

DTaves <- DT[, lapply(.SD, mean), .SDcols = VarToAve, by = c(GrpVar1, GrpVar2)]
DTmaxs <- DT[, lapply(.SD, max), .SDcols = VarToMax, by = c(GrpVar1, GrpVar2)]
merge(DTmaxs, DTaves)
##    a e b c          d
## 1: A Y 6 4  0.2230091
## 2: B Z 7 6  0.5909434
## 3: C Z 8 7 -0.4828223
## 4: D Z 9 8 -1.3591240

Alternatively, you can do this in one go by subsetting the .SD using the .. notation to look for VarToAve in the parent frame of .SD (as opposed to a column named VarToAve)

DT[, c(lapply(.SD[, ..VarToAve], mean), 
       lapply(.SD[, ..VarToMax], max)), 
   by = c(GrpVar1, GrpVar2)]
##    a e c          d b
## 1: A Y 4  0.2230091 6
## 2: B Z 6  0.5909434 7
## 3: C Z 7 -0.4828223 8
## 4: D Z 8 -1.3591240 9
like image 123
shadow Avatar answered Sep 23 '22 16:09

shadow