I was wondering whether the island and gaps problems can be solved in R efficiently, similar to SQL. I have the following data available, if we examine one ID:
ID StartDate StartTime EndDate EndTime
1 19-05-2014 19:00 19-05-2014 20:00
1 19-05-2014 19:30 19-05-2014 23:30
1 19-05-2014 16:00 19-05-2014 18:00
1 20-05-2014 20:00 20-05-2014 20:30
Notice that the first two rows overlap, what I would like to do, is merge the overlapping rows, resulting:
ID StartDate StartTime EndDate EndTime
1 19-05-2014 19:00 19-05-2014 23:30
1 19-05-2014 16:00 19-05-2014 18:00
1 20-05-2014 20:00 20-05-2014 20:30
Is there a way to do this in R?
I am well aware this is done in SQL, but since my data is already in R, I prefer to do this in R. Second, I have some questions regarding the performance of finding gaps and islands, I know that SQL is very fast in doing that, but I wonder whether R is faster due to all the data being in memory.
I would like to use data.table
to do this, but I don't know how.
UPDATE - Response to Arun
I have created the following test case that contains every possible interval orientation.
dat <- structure(
list(ID = c(1L, 1L, 1L, 1L, 1L, 1L),
stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 11:30:00"),
as.POSIXct("2014-01-15 12:00:00")),
class = c("POSIXct", "POSIXt"), tzone = ""),
etime = structure(c(as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 11:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 12:30:00"),
as.POSIXct("2014-01-15 13:00:00")),
class = c("POSIXct", "POSIXt"), tzone = "")
),
.Names = c("ID", "stime", "etime"),
sorted = c("ID", "stime", "etime"),
class = c("data.table", "data.frame"),
row.names = c(NA,-6L)
)
I would expect that the interval from 8:30 - 10:00 will be "glued" onto 10:00 - 11:00, but that was not the case. The result was:
idx ID stime etime
1: 4 1 2014-01-15 08:00:00 2014-01-15 10:00:00
2: 3 1 2014-01-15 10:00:00 2014-01-15 11:00:00
3: 6 1 2014-01-15 11:30:00 2014-01-15 13:00:00
The following data set provides a more thorough testing:
# The numbers represent seconds from 1970-01-01 01:00:01
dat <- structure(
list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 11:30:00"),
as.POSIXct("2014-01-15 12:00:00"),
as.POSIXct("2014-01-15 07:30:00"),
as.POSIXct("2014-01-15 08:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 10:00:00")
),
class = c("POSIXct", "POSIXt"), tzone = ""),
etime = structure(c(as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 11:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 12:30:00"),
as.POSIXct("2014-01-15 13:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 10:30:00"),
as.POSIXct("2014-01-15 11:00:00")
),
class = c("POSIXct", "POSIXt"), tzone = "")
),
.Names = c("ID", "stime", "etime"),
sorted = c("ID", "stime", "etime"),
class = c("data.table", "data.frame"),
row.names = c(NA,-6L)
)
So our result is:
idx ID stime etime
1: 4 1 2014-01-15 08:00:00 2014-01-15 10:00:00
2: 3 1 2014-01-15 10:00:00 2014-01-15 11:00:00
3: 6 1 2014-01-15 11:30:00 2014-01-15 13:00:00
4: 12 2 2014-01-15 07:30:00 2014-01-15 09:30:00
5: 13 2 2014-01-15 09:00:00 2014-01-15 11:00:00
Now for respondent with ID=2, we see that the intervals are overlapping, but not reported as one interval. The correct solution would be:
idx ID stime etime
1: ? 1 2014-01-15 08:00:00 2014-01-15 11:00:00
3: ? 1 2014-01-15 11:30:00 2014-01-15 13:00:00
4: ?? 2 2014-01-15 07:30:00 2014-01-15 11:00:00
Update - Benchmarks and testing and large datasets
I have the following dataset with about 1000 users, each having 500 durations, giving 0.5 million rows. You can download the dataset at my Google Drive, including the solution in Google Drive.
SQL Server 2014 on a laptop of 8GB RAM, 64-bit, i5-4210U CPU @ 1.70Ghz - 2.39Ghz takes about 5 seconds to do this using the solution provided by Itzik Ben-Gan in SQL. The 5 seconds are excluding the process of creating a function. In addition, no indices are created for any table whatsoever.
PS: I use library(lubridate);
Here's a very simple idea. Order by start time, then find cumulative max of end time. Once you've done that, the overlap groups are simply those where the next start time is still less than or equal to current cumulative max end time (all done by ID):
setorder(dat, ID, stime) # ordering by ID is unnecessary, it's just prettier
dat[, etime.max := as.POSIXct(cummax(as.numeric(etime)), origin = '1970-01-01'), by = ID]
# find the grouping of intervals (1:.N hack is to avoid warnings when .N=1)
dat[, grp := cumsum(c(FALSE, stime[2:.N] > etime.max[1:(.N-1)]))[1:.N], by = ID]
dat[, .(stime = min(stime), etime = max(etime)), by = .(ID, grp)][, grp := NULL][]
# ID stime etime
#1: 1 2014-01-15 08:00:00 2014-01-15 11:00:00
#2: 1 2014-01-15 11:30:00 2014-01-15 13:00:00
#3: 2 2014-01-15 07:30:00 2014-01-15 11:00:00
Since this doesn't need to find all possible overlaps, it's very fast. On a simulated data set that roughly matches OP's description it's instantaneous for me (< 0.2s).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With