Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive assignment in data.table

Tags:

r

data.table

Is it possible in data.table to perform recursive assignment of multiple columns? By recursive I mean that the next assignment depends on the previous assignment:

library(data.table)
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]

# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val),  : 
#   cannot coerce type 'builtin' to vector of type 'double'

Of course, one can do the assignments individually, but I guess the overhead cost (e.g. grouping) wouldn't be shared among the operations:

DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
DT
#    id val cumsum cumsumofcumsum
# 1:  A   1      1              1
# 2:  A   2      3              4
# 3:  B   3      3              3
# 4:  B   4      7             10
# 5:  C   5      5              5
# 6:  C   6     11             16
# 7:  D   7      7              7
# 8:  D   8     15             22
like image 333
Daniel Krizian Avatar asked Oct 04 '14 11:10

Daniel Krizian


1 Answers

You can use a temporary variable and use it again for others variables:

DT[, c("cumsum", "cumsumofcumsum"):={
              x <- cumsum(val)
              list(x, cumsum(x))
              }, by=id]

Of course you can use dplyr and use your data.table as a backend, but I am not sure that you will get the same performance as the pure data.table method:

library(dplyr)
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = cumsum(val),
       cum2 = cumsum(cum1)
)

EDIT add some benchamrks:

Pure data.table solution is 5 times faster than dplyr one. I guess the sort in dplyr behind the scene can explain this difference.

f_dt <- 
  function(){
DT[, c("cumsum", "cumsumofcumsum"):={
  x <- as.numeric(cumsum(val))
  list(x, cumsum(x))
}, by=id]
}

f_dplyr <- 
  function(){
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = as.numeric(cumsum(val)),
       cum2 = cumsum(cum1)
)
}


library(microbenchmark)

microbenchmark(f_dt(),f_dplyr(),times = 100)
    expr       min       lq    median        uq       max neval
    f_dt()  2.580121  2.97114  3.256156  4.318658  13.49149   100
 f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626   100
like image 105
agstudy Avatar answered Oct 29 '22 17:10

agstudy