Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find overlapping between date and time?

Tags:

r

I have a data frame with products and intervals:

df <- data.frame(Product=c("x","y","z", "x"), 
      Start_Date=c("1/1/2015 08:00", "3/1/2015 10:00", "4/1/2015 09:00", "4/1/2015 12:34"), 
      End_Date=c("2/1/2015 09:00","5/1/2015 12:00","5/1/2015 13:00", "7/1/2015 11:23"))
  Product     Start_Date       End_Date
1       x 1/1/2015 08:00 2/1/2015 09:00
2       y 3/1/2015 10:00 5/1/2015 12:00
3       z 4/1/2015 09:00 5/1/2015 13:00
4       x 4/1/2015 12:34 7/1/2015 11:23

I want to find the overlaps among the dates and times for the products and split them into a disjoined set. I would like to keep track of which products occurred in each interval.

I would like the final result to be similar to this:

df2 <- data.frame(Product=c("x","y", "y/z", "y/z/x", "z/x", "x"), 
     Start_Date=c("1/1/2015 08:00", "3/1/2015 10:00", "4/1/2015 09:00", "4/1/2015 12:34","5/1/2015 
                  12:00", "5/1/2015 13:00"), 
    End_Date=c("2/1/2015 09:00", "4/1/2015 09:00", "4/1/2015 12:34", "5/1/2015 12:00", "5/1/2015 
                  13:00", "7/1/2015 11:23"))
  Product     Start_Date       End_Date
1       x 1/1/2015 08:00 2/1/2015 09:00
2       y 3/1/2015 10:00 4/1/2015 09:00
3     y/z 4/1/2015 09:00 4/1/2015 12:34
4   y/z/x 4/1/2015 12:34 5/1/2015 12:00
5     z/x 5/1/2015 12:00 5/1/2015 13:00
6       x 5/1/2015 13:00 7/1/2015 11:23
like image 611
nls1200 Avatar asked Mar 01 '23 15:03

nls1200


1 Answers

I'm going to imagine that this problem is substantially more challenging than you're letting on. These interval overlap problems are well handled by the IRanges package.

First, let's convert the dates to integers for fast processing:

df[,-1] <- lapply(df[,-1],function(x) as.integer(as.POSIXct(x,"%m/%d/%Y %H:%M",tz = "America/New_York")))
df
  Product Start_Date   End_Date
1       x 1420099200 1422781200
2       y 1425204000 1430481600
3       z 1427878800 1430485200
4       x 1427891640 1435749780

Now we can convert to IRanges:

library(IRanges)
df.ranges <- IRanges(df$Start_Date,end = df$End_Date,names = df$Product)
df.ranges
IRanges object with 4 ranges and 0 metadata columns:
         start        end     width
     <integer>  <integer> <integer>
  x 1420099200 1422781200   2682001
  y 1425204000 1430481600   5277601
  z 1427878800 1430485200   2606401
  x 1427891640 1435749780   7858141

Now we can use disjoin to split up the ranges into sets of overlapping intervals:

df.disjoin <- disjoin(df.ranges)
df.disjoin
IRanges object with 6 ranges and 0 metadata columns:
           start        end     width
       <integer>  <integer> <integer>
  [1] 1420099200 1422781200   2682001
  [2] 1425204000 1427878799   2674800
  [3] 1427878800 1427891639     12840
  [4] 1427891640 1430481600   2589961
  [5] 1430481601 1430485200      3600
  [6] 1430485201 1435749780   5264580

Now we can find the overlaps between the disjoined intervals and the original:

hits <- findOverlaps(df.disjoin,df.ranges)
hits
Hits object with 10 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         2           2
   [3]         3           2
   [4]         3           3
   [5]         4           2
   [6]         4           3
   [7]         4           4
   [8]         5           3
   [9]         5           4
  [10]         6           4
  -------
  queryLength: 6 / subjectLength: 4

From here we can obtain the desired result:

library(tidyverse)
as_tibble(hits) %>%
   mutate(Product = names(df.ranges)[subjectHits]) %>%
   group_by(queryHits) %>%
   summarise(Product = paste(Product,collapse = "/")) -> result
result
# A tibble: 6 x 2
  queryHits Product
      <int> <chr>  
1         1 x      
2         2 y      
3         3 y/z    
4         4 y/z/x  
5         5 z/x    
6         6 x   

Now we can bind the results together and format back to a recognizable date:

bind_cols(as.data.frame(df.disjoin),result) %>%
  mutate(across(start:end, ~as.POSIXlt(.,origin = "1970-01-01 00:00.00 UTC")))
                start                 end   width queryHits Product
1 2015-01-01 08:00:00 2015-02-01 09:00:00 2682001         1       x
2 2015-03-01 10:00:00 2015-04-01 08:59:59 2671200         2       y
3 2015-04-01 09:00:00 2015-04-01 12:33:59   12840         3     y/z
4 2015-04-01 12:34:00 2015-05-01 12:00:00 2589961         4   y/z/x
5 2015-05-01 12:00:01 2015-05-01 13:00:00    3600         5     z/x
6 2015-05-01 13:00:01 2015-07-01 11:23:00 5264580         6       x

This approach will be very fast for almost as many intervals as you can fit in memory.

like image 128
Ian Campbell Avatar answered Mar 12 '23 20:03

Ian Campbell