Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding gaps between intervals using data.table

Tags:

r

data.table

I have the following problem: given a set of non-overlapping intervals in a data.table, report the gaps between the intervals.

I have implemented this once in SQL, however I am struggling with data.table due to the lack of a lead function or lag function. For completeness, I have here the SQL code. I know the functionality has been implemented in data.table version 1.9.5. as by the changelog. So is this possible with data.table without doing a lot of merges and without a lag or lead function?

In principle, I am not fully against using merges (aka joins) as long as performance does not suffer. I think this has an easy implementation, but I can't figure out how to "get" the previous end time to be the starting time of my gap table.

For example:

# The numbers represent seconds from 1970-01-01 01:00:01
dat <- structure(
  list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), 
       stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
                           as.POSIXct("2014-01-15 11:00:00"),
                           as.POSIXct("2014-01-16 11:30:00"),
                           as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 12:30:00"),
                           as.POSIXct("2014-01-15 13:30:00")
                           ),
                         class = c("POSIXct", "POSIXt"), tzone = ""),
       etime = structure(c(as.POSIXct("2014-01-15 10:30:00"),
                           as.POSIXct("2014-01-15 12:00:00"),
                           as.POSIXct("2014-01-16 13:00:00"),
                           as.POSIXct("2014-01-15 11:00:00"),
                           as.POSIXct("2014-01-15 12:45:00"),
                           as.POSIXct("2014-01-15 14:30:00")
                           ), 
                         class = c("POSIXct", "POSIXt"), tzone = "")
  ),
  .Names = c("ID", "stime", "etime"),
  sorted = c("ID", "stime", "etime"),
  class = c("data.table", "data.frame"),
  row.names = c(NA,-6L)
)

dat <- data.table(dat)

This results in:

ID               stime               etime
1  2014-01-15 10:30:00 2014-01-15 11:00:00
1  2014-01-15 12:00:00 2014-01-16 11:30:00
2  2014-01-15 11:00:00 2014-01-15 12:30:00
2  2014-01-15 12:45:00 2014-01-15 13:30:00

Notice: the gaps are reported evenly across days.

like image 553
Snowflake Avatar asked Jun 04 '15 21:06

Snowflake


2 Answers

A variation on David's answer, likely a little less efficient, but simpler to type out:

setkey(dat, stime)[, .(stime=etime[-.N], etime=stime[-1]), by=ID]

Produces:

   ID               stime               etime
1:  1 2014-01-15 10:30:00 2014-01-15 11:00:00
2:  1 2014-01-15 12:00:00 2014-01-16 11:30:00
3:  2 2014-01-15 11:00:00 2014-01-15 12:30:00
4:  2 2014-01-15 12:45:00 2014-01-15 13:30:00

setkey is just to make sure table is sorted by time.

like image 144
BrodieG Avatar answered Oct 05 '22 16:10

BrodieG


If I'm not missing something, you are missing a row in your desired output, so here's my attempt using shift from the devel version as you mentioned.

library(data.table) ## v >= 1.9.5
indx <- dat[, .I[-.N], by = ID]$V1
dat[, .(ID, stimes = etime, etime = shift(stime, type = "lead"))][indx]
res
#    ID               stime               etime
# 1:  1 2014-01-15 10:30:00 2014-01-15 11:00:00
# 2:  1 2014-01-15 12:00:00 2014-01-16 11:30:00
# 3:  2 2014-01-15 11:00:00 2014-01-15 12:30:00
# 4:  2 2014-01-15 12:45:00 2014-01-15 13:30:00
like image 26
David Arenburg Avatar answered Oct 05 '22 16:10

David Arenburg