How to optimize lagged differences in data.table (r)

Question

I am trying to optimize a snippet of r code to calculate lagged differences using data.table in R. I have two working solutions, but both run painfully slow on my real data (500 million row datasets). I have enjoyed the speedup and efficiency of using data.table generally, but both solutions I implemented are quite slow (compared to other data.table operations).

Could anyone offer suggestion for more efficient coding practice in data.table for this specific task?

library(data.table)
set.seed(1)
id <- 1:10
date_samp <- seq.Date(as.Date("2010-01-01"),as.Date("2011-01-01"),"days")
dt1 <- 
  data.table(id = sample(id,size = 30,replace=T),
             date_1 = sample(date_samp,size = 30,replace=T))
setkey(dt1,id,date_1)
### Attempt to get lagged date
## Attempt 1
dt1[,date_diff:=c(0,diff(date_1)),
    by=id]
## Attempt 2
## Works but gives warnings
dt1[,date_diff:=NULL]
dt1[,n_group := .N,by=id]
dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),
    by=id]

co_biostat · Accepted Answer

After a bit more effort I found the "shift()" function on a related question. I have made the data a bit larger and done some crude profiling, as well as added a few more approaches...but please update and provide a different answer if there is a more efficient approach.

In response to comments below I added and changed some things...attempt for is numeric (not integer), and my keyed by was incorrect. I added an integer comparison and a keyed by with the integer (in addition to the numeric). It now looks like converting the date to an integer then using "grouping by each i" is the fastest solution.

library(data.table)
set.seed(1)
id <- 1:100
date_samp <- seq.Date(as.Date("2010-01-01"),as.Date("2011-01-01"),"days")
n_samp <- 1e7
dt1 <- 
  data.table(id = sample(id,size = n_samp,replace=T),
             date_1 = sample(date_samp,size = n_samp,replace=T))
setkey(dt1,id,date_1)
### Attempt to get lagged date
## Attempt 1
dt1[,date_diff:=NULL]
system.time(dt1[,date_diff:=c(0,diff(date_1)),
    by=id])
## Attempt 2
dt1[,date_diff:=NULL]
dt1[,n_group := .N,by=id]
system.time(dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),
    by=id])
## Attempt 3
dt1[,date_diff:=NULL]
system.time(dt1[,date_diff:=date_1-shift(date_1),
    by=id])
## Attempt 4
## Use numeric instead
dt1[,date_diff:=NULL]
dt1[,date_1num:=NULL]
dt1[,date_1num:=as.numeric(date_1)]
system.time(dt1[,date_diff:=date_1num-shift(date_1num),
                by=id])
## Attempt 5
## Use a keyed by
dt_key <- unique(dt1[,list(id)])
dt1[,date_diff:=NULL]
system.time(dt1[dt_key,
    date_diff:=date_1num-shift(date_1num),
    by=.EACHI])

## Attempt 6
## Use integers instead
dt1[,date_diff:=NULL]
dt1[,date_1int:=as.integer(date_1)]
system.time(dt1[,date_diff:=date_1int-shift(date_1int),
                by=id])
## Attempt 7
## Use integers with keyed by
dt1[,date_diff:=NULL]
dt1[,date_1int:=as.integer(date_1)]
system.time(dt1[dt_key,
                date_diff:=date_1int-shift(date_1int),
                by=.EACHI])


# attempt   user  system elapsed 
# 1         0.34    0.25    0.59     
# 2         0.37    0.28    0.67 
# 3         0.25    0.16    0.41
# 4         0.11    0.01    0.13 
# 5         0.06    0.03    0.10
# 6         0.09    0.00    0.09
# 7         0.05    0.00    0.04

Clayton Stanley · Answer

If you want to nix the by:

  dt1[order(id, date_1)
      ][, idP := shift(id, type='lag')
      ][, headP := is.na(idP) | idP != id
      ][, date_1P := shift(date_1, type='lag')
      ][headP == T, date_diff := 0
      ][headP == F, date_diff := date_1 - date_1P
      ][, c('headP', 'idP', 'date_1P') := NULL
      ][]

How to optimize lagged differences in data.table (r)

Tags:

r

data.table

co_biostat

2 Answers

co_biostat

Clayton Stanley

Recent Activity

Donate For Us

How to optimize lagged differences in data.table (r)

Tags:

r

data.table

co_biostat

2 Answers

co_biostat

Clayton Stanley

Related questions

Recent Activity

Donate For Us