R - How to run average & max on different data.table columns based on multiple factors & return original colnames

Question

I am changing my R code from data.frame + plyr to data.tables as I need a faster and more memory-efficient way to handle a big data set. Unfortunately, my R skills are woefully limited and I've hit a wall for the whole day. Would appreciate if SO experts here can enlighten.

My Goals

Aggregate rows in my data.table based on 2 functions - average and max - run on selected columns (with column names passed via vector) while grouping by columns also passed via vector.
The resulting DT should contain the original column names.
There should not be unnecessary copying of the DT in order to conserve memory

My Test Code

DT = data.table( a=LETTERS[c(1,1,1:4)],b=4:9, c=3:8, d = rnorm(6), 
                 e=LETTERS[c(rep(25,3),rep(26,3))], key="a" )

GrpVar1 <- "a"
GrpVar2 <- "e"
VarToMax <- "b"
VarToAve <- c( "c", "d")

What I tried but didn't work for me

DT[, list( b=max( b ), c=mean(c), d=mean(d) ), by=c( GrpVar1, GrpVar2 ) ]  
# Hard-code col name - not what I want

DT[, list( max( get(VarToMax) ), mean( get(VarToAve) )), by=c( GrpVar1, GrpVar2 ) ]  
# Col names become 'V1', 'V2', worse, 1 column goes missing - Not what I want either

DT[, list( get(VarToMax)=max( get(VarToMax) ), 
           get(VarToAve)=mean( get(VarToAve) ) ), by=c( GrpVar1, GrpVar2 ) ]
# Above code gave Error!

Additional Question

Based on my very limited understanding of DTs, the with = F argument should instruct R to parse the values of VarToMax and VarToAve, but running the code below leads to error.

DT[, list( max(VarToMax), mean(VarToAve) ), by=c( GrpVar1, GrpVar2 ), with=F ]

# Error in `[.data.table`(DT, , list(max(VarToMax), mean(VarToAve)), by = c(GrpVar1,  : 
#   object 'ansvals' not found
# In addition: Warning message:
# In mean.default(VarToAve) :
#   argument is not numeric or logical: returning NA

Existing SO solutions can't help

Arun's solution was how I got to this point, but I am very stuck. His other solution using lapply and .SDcols involves creating 2 extra DT, which does not meet my memory-conserving requirement.

dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]

I am SO confused over data.table! Any help would be most appreciated!

shadow · Accepted Answer

In a similar fashion as @David Arenburg, but using .SDcols in order to simplify the notation. Also I show the code until the merge.

DTaves <- DT[, lapply(.SD, mean), .SDcols = VarToAve, by = c(GrpVar1, GrpVar2)]
DTmaxs <- DT[, lapply(.SD, max), .SDcols = VarToMax, by = c(GrpVar1, GrpVar2)]
merge(DTmaxs, DTaves)
##    a e b c          d
## 1: A Y 6 4  0.2230091
## 2: B Z 7 6  0.5909434
## 3: C Z 8 7 -0.4828223
## 4: D Z 9 8 -1.3591240

Alternatively, you can do this in one go by subsetting the .SD using the .. notation to look for VarToAve in the parent frame of .SD (as opposed to a column named VarToAve)

DT[, c(lapply(.SD[, ..VarToAve], mean), 
       lapply(.SD[, ..VarToMax], max)), 
   by = c(GrpVar1, GrpVar2)]
##    a e c          d b
## 1: A Y 4  0.2230091 6
## 2: B Z 6  0.5909434 7
## 3: C Z 7 -0.4828223 8
## 4: D Z 8 -1.3591240 9

R - How to run average & max on different data.table columns based on multiple factors & return original colnames

Tags:

r

aggregate

data.table

NoviceProg

1 Answers

shadow

Recent Activity

Donate For Us

R - How to run average & max on different data.table columns based on multiple factors & return original colnames

Tags:

r

aggregate

data.table

NoviceProg

1 Answers

shadow

Related questions

Recent Activity

Donate For Us