Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply different functions to different sets of columns by group

I have a data.table with the following features:

  • bycols: columns that divide the data into groups
  • nonvaryingcols: columns that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
  • datacols: columns to be aggregated / summarized (e.g. sum them within group)

I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.

Example data, where the different sets of columns are stored in character vectors.

require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
  g1 = rep( letters, 10 ),
  g2 = rep( c(LETTERS,LETTERS), each = 5 ),
  dat1 = runif( 260 ),
  dat2 = runif( 260 ),
  nv1 = rep( seq(130), 2),
  nv2 = rep( seq(130), 2) 
)

Final data should look like:

   g1 g2      dat1      dat2 nv1 nv2
1:  a  A 0.8403809 0.6713090   1   1
2:  b  A 0.4491883 0.4607716   2   2
3:  c  A 0.6083939 1.2031960   3   3
4:  d  A 1.5510033 1.2945761   4   4
5:  e  A 1.1302971 0.8573135   5   5
6:  f  B 1.4964821 0.5133297   6   6

I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.

like image 460
Ari B. Friedman Avatar asked Mar 08 '14 18:03

Ari B. Friedman


3 Answers

As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.

Here, for instance, and given the particular values in your example, you'd like a call that looks like:

test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
       by=c("g1", "g2")]

so the expression you'd like evaluated in the j-slot is

list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])

Most of the following function is taken up with constructing just that expression:

f <- function(dt, bycols, datacols, nvcols) {
    e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
           sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
    e<- as.call(c(as.symbol("list"), e))
    dt[,eval(e), by=bycols]
}

f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
##      g1 g2      dat1      dat2 nv1 nv2
##   1:  a  A 0.8403809 0.6713090   1   1
##   2:  b  A 0.4491883 0.4607716   2   2
##   3:  c  A 0.6083939 1.2031960   3   3
##   4:  d  A 1.5510033 1.2945761   4   4
##   5:  e  A 1.1302971 0.8573135   5   5
##  ---                                  
## 126:  v  Z 0.5627018 0.4282380 126 126
## 127:  w  Z 0.7588966 1.4429034 127 127
## 128:  x  Z 0.7060596 1.3736510 128 128
## 129:  y  Z 0.6015249 0.4488285 129 129
## 130:  z  Z 1.5304034 1.6012207 130 130
like image 138
Josh O'Brien Avatar answered Oct 20 '22 23:10

Josh O'Brien


Here's what I had come up with. It works, but very slowly.

test[, {
  cbind(
    as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
    .SD[, ..nonvaryingcols][1]
  )
}, by = bycols ]

Benchmarks

FunJosh <- function() {
  f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
}
FunAri <- function() {
  test[, {
    cbind(
      as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
      .SD[, ..nonvaryingcols][1]
    )
  }, by = bycols ]
}
FunEddi <- function() {
  cbind(
    test[, lapply(.SD, sum), by = bycols, .SDcols = datacols], 
    test[, lapply(.SD, "[", 1), by = bycols, .SDcols = nonvaryingcols][, ..nonvaryingcols]
  ) 
}

library(microbenchmark)
identical(FunJosh(), FunAri())
# [1] TRUE

microbenchmark(FunJosh(), FunAri(), FunEddi())
#Unit: milliseconds
#      expr        min         lq     median         uq        max neval
# FunJosh()   2.749164   2.958478   3.098998   3.470937   6.863933   100
#  FunAri() 246.082760 255.273839 284.485654 360.471469 509.740240   100
# FunEddi()   5.877494   6.229739   6.528205   7.375939 112.895573   100

At least two orders of magnitude slower than @joshobrien's solution. Edit @Eddi's solution is much faster as well, and shows that cbind wasn't optimal but could be fairly fast in the right hands. Might be all the transforming and sapplying I was doing rather than just directly using lapply.

benchmark

like image 3
Ari B. Friedman Avatar answered Oct 20 '22 22:10

Ari B. Friedman


Just for a bit of variety, here is a variant of @Josh O'brien's solution that uses the bquote operator instead of call. I did try to replace the final as.call with a bquote, but because bquote doesn't support list splicing (e.g., see this question), I couldn't get that to work.

f <- function(dt, bycols, datacols, nvcols) {
        datacols = sapply(datacols, as.symbol)
        nvcols = sapply(nvcols, as.symbol)
        e = c(lapply(datacols, function(x) bquote(sum(.(x)))),
              lapply(nvcols, function(x) bquote(.(x)[1])))
        e = as.call(c(as.symbol("list"), e))
        dt[,eval(e), by=bycols]
}


>   f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
     g1 g2   dat1   dat2 nv1 nv2
  1:  a  A 0.8404 0.6713   1   1
  2:  b  A 0.4492 0.4608   2   2
  3:  c  A 0.6084 1.2032   3   3
  4:  d  A 1.5510 1.2946   4   4
  5:  e  A 1.1303 0.8573   5   5
 ---                            
126:  v  Z 0.5627 0.4282 126 126
127:  w  Z 0.7589 1.4429 127 127
128:  x  Z 0.7061 1.3737 128 128
129:  y  Z 0.6015 0.4488 129 129
130:  z  Z 1.5304 1.6012 130 130
> 
like image 2
Clayton Stanley Avatar answered Oct 21 '22 00:10

Clayton Stanley