Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize lagged differences in data.table (r)

Tags:

r

data.table

I am trying to optimize a snippet of r code to calculate lagged differences using data.table in R. I have two working solutions, but both run painfully slow on my real data (500 million row datasets). I have enjoyed the speedup and efficiency of using data.table generally, but both solutions I implemented are quite slow (compared to other data.table operations).

Could anyone offer suggestion for more efficient coding practice in data.table for this specific task?

library(data.table)
set.seed(1)
id <- 1:10
date_samp <- seq.Date(as.Date("2010-01-01"),as.Date("2011-01-01"),"days")
dt1 <- 
  data.table(id = sample(id,size = 30,replace=T),
             date_1 = sample(date_samp,size = 30,replace=T))
setkey(dt1,id,date_1)
### Attempt to get lagged date
## Attempt 1
dt1[,date_diff:=c(0,diff(date_1)),
    by=id]
## Attempt 2
## Works but gives warnings
dt1[,date_diff:=NULL]
dt1[,n_group := .N,by=id]
dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),
    by=id]
like image 316
co_biostat Avatar asked Sep 02 '16 02:09

co_biostat


2 Answers

After a bit more effort I found the "shift()" function on a related question. I have made the data a bit larger and done some crude profiling, as well as added a few more approaches...but please update and provide a different answer if there is a more efficient approach.

In response to comments below I added and changed some things...attempt for is numeric (not integer), and my keyed by was incorrect. I added an integer comparison and a keyed by with the integer (in addition to the numeric). It now looks like converting the date to an integer then using "grouping by each i" is the fastest solution.

library(data.table)
set.seed(1)
id <- 1:100
date_samp <- seq.Date(as.Date("2010-01-01"),as.Date("2011-01-01"),"days")
n_samp <- 1e7
dt1 <- 
  data.table(id = sample(id,size = n_samp,replace=T),
             date_1 = sample(date_samp,size = n_samp,replace=T))
setkey(dt1,id,date_1)
### Attempt to get lagged date
## Attempt 1
dt1[,date_diff:=NULL]
system.time(dt1[,date_diff:=c(0,diff(date_1)),
    by=id])
## Attempt 2
dt1[,date_diff:=NULL]
dt1[,n_group := .N,by=id]
system.time(dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),
    by=id])
## Attempt 3
dt1[,date_diff:=NULL]
system.time(dt1[,date_diff:=date_1-shift(date_1),
    by=id])
## Attempt 4
## Use numeric instead
dt1[,date_diff:=NULL]
dt1[,date_1num:=NULL]
dt1[,date_1num:=as.numeric(date_1)]
system.time(dt1[,date_diff:=date_1num-shift(date_1num),
                by=id])
## Attempt 5
## Use a keyed by
dt_key <- unique(dt1[,list(id)])
dt1[,date_diff:=NULL]
system.time(dt1[dt_key,
    date_diff:=date_1num-shift(date_1num),
    by=.EACHI])

## Attempt 6
## Use integers instead
dt1[,date_diff:=NULL]
dt1[,date_1int:=as.integer(date_1)]
system.time(dt1[,date_diff:=date_1int-shift(date_1int),
                by=id])
## Attempt 7
## Use integers with keyed by
dt1[,date_diff:=NULL]
dt1[,date_1int:=as.integer(date_1)]
system.time(dt1[dt_key,
                date_diff:=date_1int-shift(date_1int),
                by=.EACHI])


# attempt   user  system elapsed 
# 1         0.34    0.25    0.59     
# 2         0.37    0.28    0.67 
# 3         0.25    0.16    0.41
# 4         0.11    0.01    0.13 
# 5         0.06    0.03    0.10
# 6         0.09    0.00    0.09
# 7         0.05    0.00    0.04
like image 194
co_biostat Avatar answered Oct 09 '22 09:10

co_biostat


If you want to nix the by:

  dt1[order(id, date_1)
      ][, idP := shift(id, type='lag')
      ][, headP := is.na(idP) | idP != id
      ][, date_1P := shift(date_1, type='lag')
      ][headP == T, date_diff := 0
      ][headP == F, date_diff := date_1 - date_1P
      ][, c('headP', 'idP', 'date_1P') := NULL
      ][]
like image 43
Clayton Stanley Avatar answered Oct 09 '22 08:10

Clayton Stanley