Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find the overlap between two time intervals in R

Tags:

r

lubridate

combn

I have used combn() to find the overlap between two dates/times using lubridate package. But combn() is too slow to process the large dataset I am working on. I am trying to use comboGeneral() from RcppAlgos package but I can't get it to work. Any help would be appreciated. If you know any other package/function I should look at, please let me know too.

get_overlap <- function(.data, .id, .start, .end) {
  id <- .data[[.id]]
  int <- interval(.data[[.start]], .data[[.end]])
  names <- combn(id, 2, FUN = function(.) paste(., collapse = "-"))
  setNames(combn(int, 2, function(.) intersect(.[1], .[2])), names)
}

get_overlap(dat, "id", "start", "end")
# a-b a-c a-d a-e b-c b-d b-e c-d c-e d-e 
#  49   1   4  17  23  14  18  NA   2  NA 

Here is my failed attempt using comboGeneral().

comboGeneral(dat$int, 2, FUN = function(.) intersect(.[1], .[2]))

# Output:
# [[1]]
# numeric(0)
# 
# [[2]]
# numeric(0)
# 
# [[3]]
# numeric(0)
# <omitted>

Here is the dataset:

dat <- structure(list(id = c("a", "b", "c", "d", "e"), start = structure(c(1623903457.7771, 
1623903447.7771, 1623903505.7771, 1623903406.7771, 1623903489.7771
), class = c("POSIXct", "POSIXt")), end = structure(c(1623903506.7771, 
1623903528.7771, 1623903543.7771, 1623903461.7771, 1623903507.7771
), class = c("POSIXct", "POSIXt"))), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

Update:

Thank you for all the great suggestions so far! I did some benchmarking using my inelegantly written functions. If you could help further improve it, that would be great. I will update this again based on feedback.

Note that comboIter is part of comboIter_vector in which I included a mechanism for extracting the values from the C++ object object. I wanted to find out the lean efficiency of comboIter().

# Unit: microseconds
#              expr       min        lq       mean    median         uq        max neval  cld
#             combn 36092.801 37000.251 40356.8080 37311.901 38112.1010 226049.201   100    d
#      comboGeneral 33744.301 34608.702 36756.3749 35099.851 38738.6010  49378.301   100   c 
#         comboIter   447.401   568.601   634.2019   580.901   606.0505   5866.501   100 a   
#  comboIter_vector 38037.201 38823.301 39919.0570 39108.952 39562.5505  49880.101   100   cd
#        data.table  7816.001  8007.201  8289.0060  8113.401  8230.5510  15489.201   100  b  
#           IRanges  6451.001  6806.751  7104.0659  6879.651  6994.9005  14415.301   100  b  

Here is the code:

library(lubridate)
library(RcppAlgos)
library(data.table)
library(IRanges)

# combn
get_overlap_combn <- function(.data) {
  names <- combn(.data$id, 2, function(x) paste(x, collapse = "-"))
  setNames(combn(interval(.data$start, .data$end), 2, function(x) intersect(x[1], x[2])), names)
}

get_overlap_combn(dat)


# comboGeneral
get_overlap_cpp1 <- function(.data) {
  names <- unlist(comboGeneral(dat$id, 2,
                               FUN = function(x) paste(x, collapse = "-")))
  int <- interval(.data$start, .data$end)
  setNames(unlist(comboGeneral(seq_along(int), 2,
                               FUN = function(x) intersect(int[x[1]], int[x[2]]))), names)
}

get_overlap_cpp1(dat)


# comboIter
get_overlap_cpp2 <- function(.data) {
  int <- interval(.data$start, .data$end)
  comboIter(seq_along(int), 2,
            FUN = function(x) as.double(intersect(int[x[1]], int[x[2]])))
}

get_overlap_cpp2(dat)
# C++ object <000002c2b172ee90> of class 'ComboFUN' <000002c2b16fcc90>

# comboIter_vector
get_overlap_cpp3 <- function(.data) {
  int <- interval(.data$start, .data$end)
  obj_name <- comboIter(.data$id, 2,
                       FUN = function(x) paste(x, collapse = "-"))
  obj_int <- comboIter(seq_along(int), 2,
                       FUN = function(x) as.double(intersect(int[x[1]], int[x[2]])))

  obj_length <- obj_int$summary()$totalResults
  v <- vector("double", obj_length)
  name <- vector("character", obj_length)
  i <- 1

  while (i <= obj_length) {
    v[i] <- obj_int$nextIter()
    name[i] <- obj_name$nextIter()
    i <- i + 1
  }

  setNames(v, name)
}

get_overlap_cpp3(dat)


# data.table
get_overlap_dt <- function(.data) {
  data <- .data
  setDT(data)
  setkey(data, start, end)

  data <- foverlaps(data, data)[id != i.id]
  dup <- duplicated(t(apply(data[, c("id", "i.id")], 1, sort)))

  data <-
    data[dup
    ][, `:=`(
      overlap = as.double(intersect(interval(start, end), interval(i.start, i.end))),
      name = paste(id, i.id, sep = "-")
    )]
  setNames(data$overlap, data$name)
}

get_overlap_dt(dat)

get_overlap_iranges <- function(.data) {
  # setup the IRanges object
  ir <- IRanges(as.numeric(.data$start), as.numeric(.data$end), names = .data$id)
  
  # find which ids overlap with each other
  ovrlp <- findOverlaps(ir, drop.self = TRUE, drop.redundant = TRUE) 
  
  # store id indices for further use    
  hit1 <- queryHits(ovrlp)
  hit2 <- subjectHits(ovrlp)
  
  # width of overlaps between ids    
  widths <- width(pintersect(ir[hit1], ir[hit2])) - 1
  names(widths) <- paste(names(ir)[hit1], names(ir)[hit2], sep = "-")
  
  widths
}

get_overlap_iranges(dat)
like image 217
Zaw Avatar asked Jun 17 '21 04:06

Zaw


People also ask

How to check if 2 intervals overlap?

1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.

How do you find overlap?

Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.

How do you find overlapping time intervals in Python?

Algorithm (complement 2 - overlapping test entry intervals) The basic idea is 1) first take input_start to test_start (if both of them are not equal and input_start is min) 2) always take test_start and test_end 3) take test_end to input_end if test_end is less than input end (and end_input and end_test are not equal).

How do I find the overlap between two intervals?

A common type of query that arises when working with intervals is finding which intervals in one set overlap those in another. The simplest approach is to call the findOverlapsfunction on a Rangesor other object with range information (aka "range-based object"). Value For findOverlaps: see selectargument above.

How do I find the overlap of a range in SQL?

The simplest approach is to call the findOverlapsfunction on a Rangesor other object with range information (aka "range-based object"). Value For findOverlaps: see selectargument above. For countOverlaps: the overlap hit count for each range in queryusing the specified findOverlapsparameters. For RangesListobjects, it returns an IntegerListobject.

What is the gap between two ranges in a graph?

The gapbetween 2 ranges is the number of positions that separate them. The gapbetween 2 adjacent ranges is 0. By convention when one range has its start or end strictly inside the other (i.e. non-disjoint ranges), the gapis considered to be -1.

How do you find mutually exclusive time intervals?

Given a set of time intervals in any order, merge all overlapping intervals into one and output the result which should have only mutually exclusive intervals. Let the intervals be represented as pairs of integers for simplicity. For example, let the given set of intervals be { {1,3}, {2,4}, {5,7}, {6,8} }.


2 Answers

Maybe try data.table foverlaps function:

library(data.table)
setDT(dat)
setkey(dat, start, end)
foverlaps(dat, dat)[id != i.id]
like image 119
det Avatar answered Oct 20 '22 21:10

det


Another alternative for working on intervals is the "IRanges" package:

library(IRanges)

# setup the IRanges object
ir = IRanges(as.numeric(dat$start), as.numeric(dat$end), names = dat$id)

# find which ids overlap with each other
ovrlp = findOverlaps(ir, drop.self = TRUE, drop.redundant = TRUE) 

# store id indices for further use    
hit1 = queryHits(ovrlp)
hit2 = subjectHits(ovrlp)

# width of overlaps between ids    
widths = width(pintersect(ir[hit1], ir[hit2])) - 1

# result    
data.frame(id1 = names(ir)[hit1], id2 = names(ir)[hit2], widths)
#  id1 id2 widths
#1   a   d      4
#2   a   b     49
#3   a   e     17
#4   a   c      1
#5   b   d     14
#6   b   e     18
#7   b   c     23
#8   c   e      2
like image 30
alexis_laz Avatar answered Oct 20 '22 21:10

alexis_laz