I have used combn()
to find the overlap between two dates/times using lubridate
package. But combn()
is too slow to process the large dataset I am working on. I am trying to use comboGeneral()
from RcppAlgos
package but I can't get it to work. Any help would be appreciated. If you know any other package/function I should look at, please let me know too.
get_overlap <- function(.data, .id, .start, .end) {
id <- .data[[.id]]
int <- interval(.data[[.start]], .data[[.end]])
names <- combn(id, 2, FUN = function(.) paste(., collapse = "-"))
setNames(combn(int, 2, function(.) intersect(.[1], .[2])), names)
}
get_overlap(dat, "id", "start", "end")
# a-b a-c a-d a-e b-c b-d b-e c-d c-e d-e
# 49 1 4 17 23 14 18 NA 2 NA
Here is my failed attempt using comboGeneral()
.
comboGeneral(dat$int, 2, FUN = function(.) intersect(.[1], .[2]))
# Output:
# [[1]]
# numeric(0)
#
# [[2]]
# numeric(0)
#
# [[3]]
# numeric(0)
# <omitted>
Here is the dataset:
dat <- structure(list(id = c("a", "b", "c", "d", "e"), start = structure(c(1623903457.7771,
1623903447.7771, 1623903505.7771, 1623903406.7771, 1623903489.7771
), class = c("POSIXct", "POSIXt")), end = structure(c(1623903506.7771,
1623903528.7771, 1623903543.7771, 1623903461.7771, 1623903507.7771
), class = c("POSIXct", "POSIXt"))), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Update:
Thank you for all the great suggestions so far! I did some benchmarking using my inelegantly written functions. If you could help further improve it, that would be great. I will update this again based on feedback.
Note that comboIter
is part of comboIter_vector
in which I included a mechanism for extracting the values from the C++ object
object. I wanted to find out the lean efficiency of comboIter()
.
# Unit: microseconds
# expr min lq mean median uq max neval cld
# combn 36092.801 37000.251 40356.8080 37311.901 38112.1010 226049.201 100 d
# comboGeneral 33744.301 34608.702 36756.3749 35099.851 38738.6010 49378.301 100 c
# comboIter 447.401 568.601 634.2019 580.901 606.0505 5866.501 100 a
# comboIter_vector 38037.201 38823.301 39919.0570 39108.952 39562.5505 49880.101 100 cd
# data.table 7816.001 8007.201 8289.0060 8113.401 8230.5510 15489.201 100 b
# IRanges 6451.001 6806.751 7104.0659 6879.651 6994.9005 14415.301 100 b
Here is the code:
library(lubridate)
library(RcppAlgos)
library(data.table)
library(IRanges)
# combn
get_overlap_combn <- function(.data) {
names <- combn(.data$id, 2, function(x) paste(x, collapse = "-"))
setNames(combn(interval(.data$start, .data$end), 2, function(x) intersect(x[1], x[2])), names)
}
get_overlap_combn(dat)
# comboGeneral
get_overlap_cpp1 <- function(.data) {
names <- unlist(comboGeneral(dat$id, 2,
FUN = function(x) paste(x, collapse = "-")))
int <- interval(.data$start, .data$end)
setNames(unlist(comboGeneral(seq_along(int), 2,
FUN = function(x) intersect(int[x[1]], int[x[2]]))), names)
}
get_overlap_cpp1(dat)
# comboIter
get_overlap_cpp2 <- function(.data) {
int <- interval(.data$start, .data$end)
comboIter(seq_along(int), 2,
FUN = function(x) as.double(intersect(int[x[1]], int[x[2]])))
}
get_overlap_cpp2(dat)
# C++ object <000002c2b172ee90> of class 'ComboFUN' <000002c2b16fcc90>
# comboIter_vector
get_overlap_cpp3 <- function(.data) {
int <- interval(.data$start, .data$end)
obj_name <- comboIter(.data$id, 2,
FUN = function(x) paste(x, collapse = "-"))
obj_int <- comboIter(seq_along(int), 2,
FUN = function(x) as.double(intersect(int[x[1]], int[x[2]])))
obj_length <- obj_int$summary()$totalResults
v <- vector("double", obj_length)
name <- vector("character", obj_length)
i <- 1
while (i <= obj_length) {
v[i] <- obj_int$nextIter()
name[i] <- obj_name$nextIter()
i <- i + 1
}
setNames(v, name)
}
get_overlap_cpp3(dat)
# data.table
get_overlap_dt <- function(.data) {
data <- .data
setDT(data)
setkey(data, start, end)
data <- foverlaps(data, data)[id != i.id]
dup <- duplicated(t(apply(data[, c("id", "i.id")], 1, sort)))
data <-
data[dup
][, `:=`(
overlap = as.double(intersect(interval(start, end), interval(i.start, i.end))),
name = paste(id, i.id, sep = "-")
)]
setNames(data$overlap, data$name)
}
get_overlap_dt(dat)
get_overlap_iranges <- function(.data) {
# setup the IRanges object
ir <- IRanges(as.numeric(.data$start), as.numeric(.data$end), names = .data$id)
# find which ids overlap with each other
ovrlp <- findOverlaps(ir, drop.self = TRUE, drop.redundant = TRUE)
# store id indices for further use
hit1 <- queryHits(ovrlp)
hit2 <- subjectHits(ovrlp)
# width of overlaps between ids
widths <- width(pintersect(ir[hit1], ir[hit2])) - 1
names(widths) <- paste(names(ir)[hit1], names(ir)[hit2], sep = "-")
widths
}
get_overlap_iranges(dat)
1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.
Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.
Algorithm (complement 2 - overlapping test entry intervals) The basic idea is 1) first take input_start to test_start (if both of them are not equal and input_start is min) 2) always take test_start and test_end 3) take test_end to input_end if test_end is less than input end (and end_input and end_test are not equal).
A common type of query that arises when working with intervals is finding which intervals in one set overlap those in another. The simplest approach is to call the findOverlapsfunction on a Rangesor other object with range information (aka "range-based object"). Value For findOverlaps: see selectargument above.
The simplest approach is to call the findOverlapsfunction on a Rangesor other object with range information (aka "range-based object"). Value For findOverlaps: see selectargument above. For countOverlaps: the overlap hit count for each range in queryusing the specified findOverlapsparameters. For RangesListobjects, it returns an IntegerListobject.
The gapbetween 2 ranges is the number of positions that separate them. The gapbetween 2 adjacent ranges is 0. By convention when one range has its start or end strictly inside the other (i.e. non-disjoint ranges), the gapis considered to be -1.
Given a set of time intervals in any order, merge all overlapping intervals into one and output the result which should have only mutually exclusive intervals. Let the intervals be represented as pairs of integers for simplicity. For example, let the given set of intervals be { {1,3}, {2,4}, {5,7}, {6,8} }.
Maybe try data.table
foverlaps
function:
library(data.table)
setDT(dat)
setkey(dat, start, end)
foverlaps(dat, dat)[id != i.id]
Another alternative for working on intervals is the "IRanges" package:
library(IRanges)
# setup the IRanges object
ir = IRanges(as.numeric(dat$start), as.numeric(dat$end), names = dat$id)
# find which ids overlap with each other
ovrlp = findOverlaps(ir, drop.self = TRUE, drop.redundant = TRUE)
# store id indices for further use
hit1 = queryHits(ovrlp)
hit2 = subjectHits(ovrlp)
# width of overlaps between ids
widths = width(pintersect(ir[hit1], ir[hit2])) - 1
# result
data.frame(id1 = names(ir)[hit1], id2 = names(ir)[hit2], widths)
# id1 id2 widths
#1 a d 4
#2 a b 49
#3 a e 17
#4 a c 1
#5 b d 14
#6 b e 18
#7 b c 23
#8 c e 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With