I am trying to optimize a snippet of r code to calculate lagged differences using data.table in R. I have two working solutions, but both run painfully slow on my real data (500 million row datasets). I have enjoyed the speedup and efficiency of using data.table generally, but both solutions I implemented are quite slow (compared to other data.table operations).
Could anyone offer suggestion for more efficient coding practice in data.table for this specific task?
library(data.table)
set.seed(1)
id <- 1:10
date_samp <- seq.Date(as.Date("2010-01-01"),as.Date("2011-01-01"),"days")
dt1 <-
data.table(id = sample(id,size = 30,replace=T),
date_1 = sample(date_samp,size = 30,replace=T))
setkey(dt1,id,date_1)
### Attempt to get lagged date
## Attempt 1
dt1[,date_diff:=c(0,diff(date_1)),
by=id]
## Attempt 2
## Works but gives warnings
dt1[,date_diff:=NULL]
dt1[,n_group := .N,by=id]
dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),
by=id]
After a bit more effort I found the "shift()" function on a related question. I have made the data a bit larger and done some crude profiling, as well as added a few more approaches...but please update and provide a different answer if there is a more efficient approach.
In response to comments below I added and changed some things...attempt for is numeric (not integer), and my keyed by was incorrect. I added an integer comparison and a keyed by with the integer (in addition to the numeric). It now looks like converting the date to an integer then using "grouping by each i" is the fastest solution.
library(data.table)
set.seed(1)
id <- 1:100
date_samp <- seq.Date(as.Date("2010-01-01"),as.Date("2011-01-01"),"days")
n_samp <- 1e7
dt1 <-
data.table(id = sample(id,size = n_samp,replace=T),
date_1 = sample(date_samp,size = n_samp,replace=T))
setkey(dt1,id,date_1)
### Attempt to get lagged date
## Attempt 1
dt1[,date_diff:=NULL]
system.time(dt1[,date_diff:=c(0,diff(date_1)),
by=id])
## Attempt 2
dt1[,date_diff:=NULL]
dt1[,n_group := .N,by=id]
system.time(dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),
by=id])
## Attempt 3
dt1[,date_diff:=NULL]
system.time(dt1[,date_diff:=date_1-shift(date_1),
by=id])
## Attempt 4
## Use numeric instead
dt1[,date_diff:=NULL]
dt1[,date_1num:=NULL]
dt1[,date_1num:=as.numeric(date_1)]
system.time(dt1[,date_diff:=date_1num-shift(date_1num),
by=id])
## Attempt 5
## Use a keyed by
dt_key <- unique(dt1[,list(id)])
dt1[,date_diff:=NULL]
system.time(dt1[dt_key,
date_diff:=date_1num-shift(date_1num),
by=.EACHI])
## Attempt 6
## Use integers instead
dt1[,date_diff:=NULL]
dt1[,date_1int:=as.integer(date_1)]
system.time(dt1[,date_diff:=date_1int-shift(date_1int),
by=id])
## Attempt 7
## Use integers with keyed by
dt1[,date_diff:=NULL]
dt1[,date_1int:=as.integer(date_1)]
system.time(dt1[dt_key,
date_diff:=date_1int-shift(date_1int),
by=.EACHI])
# attempt user system elapsed
# 1 0.34 0.25 0.59
# 2 0.37 0.28 0.67
# 3 0.25 0.16 0.41
# 4 0.11 0.01 0.13
# 5 0.06 0.03 0.10
# 6 0.09 0.00 0.09
# 7 0.05 0.00 0.04
If you want to nix the by:
dt1[order(id, date_1)
][, idP := shift(id, type='lag')
][, headP := is.na(idP) | idP != id
][, date_1P := shift(date_1, type='lag')
][headP == T, date_diff := 0
][headP == F, date_diff := date_1 - date_1P
][, c('headP', 'idP', 'date_1P') := NULL
][]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With