Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find matching intervals in data frame by range of two column values

Tags:

r

dplyr

I have a data frame of time related events.

Here is an example:

Name     Event Order     Sequence     start_event     end_event     duration     Group 
JOHN     1               A               0               19          19           ID1
JOHN     2               A               60              112         52           ID1  
JOHN     3               A               392             429         37           ID1  
JOHN     4               B               282             329         47           ID1
JOHN     5               C               147             226         79           ID1  
JOHN     6               C               566             611         45           ID1  
ADAM     1               A               19              75          56           ID2
ADAM     2               A               384             407         23           ID2  
ADAM     3               B               0               79          79           ID2  
ADAM     4               B               505             586         81           ID2
ADAM     5               C               140             205         65           ID2  
ADAM     6               C               522             599         77           ID2  

There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.

I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.

Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.

I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.

My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.

Thank you!

UPDATE: Here is the example of the desired result

  Name     Event Order     Sequence     start_event     end_event     duration     Group 
    JOHN     3               A               392             429         37           ID1        
    JOHN     5               C               147             226         79           ID1  
    JOHN     6               C               566             611         45           ID1  
    ADAM     2               A               384             407         23           ID2  
    ADAM     5               C               140             205         65           ID2  
    ADAM     6               C               522             599         77           ID2  

I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.

like image 492
wetcoaster Avatar asked Oct 17 '15 22:10

wetcoaster


1 Answers

As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:

overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
  jpos <- which(x$Name == "JOHN")
  njpos <- which(x$Name != "JOHN")
  over <- outer(jpos, njpos, function(a, b) {
    overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
  })
  x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
#      Name EventOrder Sequence start_event end_event duration Group
# A.2  JOHN          2        A          60       112       52   ID1
# A.3  JOHN          3        A         392       429       37   ID1
# A.7  ADAM          1        A          19        75       56   ID2
# A.8  ADAM          2        A         384       407       23   ID2
# C.5  JOHN          5        C         147       226       79   ID1
# C.6  JOHN          6        C         566       611       45   ID1
# C.11 ADAM          5        C         140       205       65   ID2
# C.12 ADAM          6        C         522       599       77   ID2

Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].

This could be pretty easily mapped into dplyr language:

library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
  jpos <- which(name == "JOHN")
  njpos <- which(name != "JOHN")
  over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
  c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
  group_by(Sequence) %>%
  slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
# 
#     Name EventOrder Sequence start_event end_event duration  Group
#   (fctr)      (int)   (fctr)       (int)     (int)    (int) (fctr)
# 1   JOHN          2        A          60       112       52    ID1
# 2   JOHN          3        A         392       429       37    ID1
# 3   ADAM          1        A          19        75       56    ID2
# 4   ADAM          2        A         384       407       23    ID2
# 5   JOHN          5        C         147       226       79    ID1
# 6   JOHN          6        C         566       611       45    ID1
# 7   ADAM          5        C         140       205       65    ID2
# 8   ADAM          6        C         522       599       77    ID2

If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:

overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
  dat <- dat[dat$Name %in% c(user1, user2),]
  do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
    jpos <- which(x$Name == user1)
    njpos <- which(x$Name == user2)
    over <- outer(jpos, njpos, function(a, b) {
      overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
    })
    x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
  }))
}

You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:

apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))
like image 79
josliber Avatar answered Oct 11 '22 05:10

josliber