Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can a SQL non-equi join task (example below) be done (faster and /or neater) using data.table?

Tags:

r

data.table

I have a data table with 3 columns: pid ( a job) , starttime (time when it starts ) and fintime (time when it ends) as made up below:

require(data.table)

dt <- data.table(pid=sample(1:100,100), starttime = sample(1:100,100)/100)[,fintime:=starttime + round(runif(100)/4,2)] 

I need to identify all possible TWO jobs that can be done sequentially but confirming to an acceptable “gap” between the Jobs. I can do this using SQL for a gap between 0.05 and 0.4 units(of time) as below:

require(sqldf)
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as startime,  b.fintime as fintime 
    from  dt a, dt b 
    where  a.fintime < b.starttime - 0.05
       and a.fintime > b.starttime - 0.4
  ")

How can I do this using data.table ? (I am hoping for a performance improvement over sqldf when the data is large and with more constraints)

like image 780
6th Avatar asked Jan 25 '26 04:01

6th


1 Answers

So here is a data.table approach that is about 20X faster, but there are some caveats (described at the end).

require(data.table)
set.seed(1)  # for reproducible example
n <- 100     # simple example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n,2)[,fintime:=starttime + round(runif(n)/4,2)]
# sqldf approach
require(sqldf)
f.sql <- function(dt) {
  sqldf("create index idx on dt(starttime,fintime)")
  res <- sqldf("select a.pid as first, b.pid as second , a.starttime as starttime,  b.fintime as fintime 
    from  dt a, dt b 
    where b.starttime >= a.fintime + 0.05
      and b.starttime <= a.fintime + 0.4
  ")  
}
res.sql <- f.sql(dt)

# data.table approach with foverlaps(...): need >= 1.9.4 for this!!
packageVersion("data.table")
# [1] ‘1.9.4’
f.DT <- function(dt) {
  lookup <- dt[,list(second=pid, fintime, a=starttime,b=starttime)]
  setkey(lookup,a,b)
  DT      <- dt[,list(first=pid, starttime, a=fintime+0.05,b=fintime+0.4)]
  J.olaps <- foverlaps(DT,lookup,type="any",nomatch=0) 
  J.olaps[,list(first,second,starttime,fintime)]
}
res.DT <- f.DT(dt)

So this uses the foverlaps(...) function in the newest version of data.table (1.9.4). Suppose you have two data.tables, x and y. Each has a pair of columns that form a range. foverlaps(...) finds all combinations of records in x and y where there is overlap between the range in x and the range in y. Here we set this up so that x has range defined by fintime+0.04 and fintime+0.5, and y has range defined by starttime at both ends. So now foverlaps(...) looks for any combination of records where the starttime is between 0.04 and 0.5 more than the fintime.

Now for the caveats:

First, this only works (AFAIK) if you are willing to relax your constraints to a closed interval (e.g., b.starttime >= a.fintime + 0.05, vs. strictly >).

Second, the data.table approach finds all the records found in the sql approach plus some additional records. You can see this with the following code:

indx  <- data.table(first=res.sql$first,second=res.sql$second,key=c("first","second"))
setkey(res.DT,first,second)
extra <-  res.DT[!indx,]

The extra records seem like they are legitimate, so the question is: why are they not found by sqldf(...)? I can't answer that.

Third, this works for your example, but might not be easy to extend with "more constraints".

Finally, here is a "benchmark" with a dataset more similar to your actual data:

set.seed(1)
n <- 1e4   # more realistic example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n)[,fintime:=starttime + round(runif(n)/4,2)]
system.time(res.sql <- f.sql(dt))
#    user  system elapsed 
#   45.25    0.53   45.80 
system.time(res.DT  <- f.DT(dt))
#    user  system elapsed 
#    2.09    0.86    2.94 
like image 187
jlhoward Avatar answered Jan 26 '26 16:01

jlhoward