I have a data frame with products and intervals:
df <- data.frame(Product=c("x","y","z", "x"),
Start_Date=c("1/1/2015 08:00", "3/1/2015 10:00", "4/1/2015 09:00", "4/1/2015 12:34"),
End_Date=c("2/1/2015 09:00","5/1/2015 12:00","5/1/2015 13:00", "7/1/2015 11:23"))
Product Start_Date End_Date
1 x 1/1/2015 08:00 2/1/2015 09:00
2 y 3/1/2015 10:00 5/1/2015 12:00
3 z 4/1/2015 09:00 5/1/2015 13:00
4 x 4/1/2015 12:34 7/1/2015 11:23
I want to find the overlaps among the dates and times for the products and split them into a disjoined set. I would like to keep track of which products occurred in each interval.
I would like the final result to be similar to this:
df2 <- data.frame(Product=c("x","y", "y/z", "y/z/x", "z/x", "x"),
Start_Date=c("1/1/2015 08:00", "3/1/2015 10:00", "4/1/2015 09:00", "4/1/2015 12:34","5/1/2015
12:00", "5/1/2015 13:00"),
End_Date=c("2/1/2015 09:00", "4/1/2015 09:00", "4/1/2015 12:34", "5/1/2015 12:00", "5/1/2015
13:00", "7/1/2015 11:23"))
Product Start_Date End_Date
1 x 1/1/2015 08:00 2/1/2015 09:00
2 y 3/1/2015 10:00 4/1/2015 09:00
3 y/z 4/1/2015 09:00 4/1/2015 12:34
4 y/z/x 4/1/2015 12:34 5/1/2015 12:00
5 z/x 5/1/2015 12:00 5/1/2015 13:00
6 x 5/1/2015 13:00 7/1/2015 11:23
I'm going to imagine that this problem is substantially more challenging than you're letting on. These interval overlap problems are well handled by the IRanges
package.
First, let's convert the dates to integers for fast processing:
df[,-1] <- lapply(df[,-1],function(x) as.integer(as.POSIXct(x,"%m/%d/%Y %H:%M",tz = "America/New_York")))
df
Product Start_Date End_Date
1 x 1420099200 1422781200
2 y 1425204000 1430481600
3 z 1427878800 1430485200
4 x 1427891640 1435749780
Now we can convert to IRanges:
library(IRanges)
df.ranges <- IRanges(df$Start_Date,end = df$End_Date,names = df$Product)
df.ranges
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
x 1420099200 1422781200 2682001
y 1425204000 1430481600 5277601
z 1427878800 1430485200 2606401
x 1427891640 1435749780 7858141
Now we can use disjoin to split up the ranges into sets of overlapping intervals:
df.disjoin <- disjoin(df.ranges)
df.disjoin
IRanges object with 6 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1420099200 1422781200 2682001
[2] 1425204000 1427878799 2674800
[3] 1427878800 1427891639 12840
[4] 1427891640 1430481600 2589961
[5] 1430481601 1430485200 3600
[6] 1430485201 1435749780 5264580
Now we can find the overlaps between the disjoined intervals and the original:
hits <- findOverlaps(df.disjoin,df.ranges)
hits
Hits object with 10 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 2 2
[3] 3 2
[4] 3 3
[5] 4 2
[6] 4 3
[7] 4 4
[8] 5 3
[9] 5 4
[10] 6 4
-------
queryLength: 6 / subjectLength: 4
From here we can obtain the desired result:
library(tidyverse)
as_tibble(hits) %>%
mutate(Product = names(df.ranges)[subjectHits]) %>%
group_by(queryHits) %>%
summarise(Product = paste(Product,collapse = "/")) -> result
result
# A tibble: 6 x 2
queryHits Product
<int> <chr>
1 1 x
2 2 y
3 3 y/z
4 4 y/z/x
5 5 z/x
6 6 x
Now we can bind the results together and format back to a recognizable date:
bind_cols(as.data.frame(df.disjoin),result) %>%
mutate(across(start:end, ~as.POSIXlt(.,origin = "1970-01-01 00:00.00 UTC")))
start end width queryHits Product
1 2015-01-01 08:00:00 2015-02-01 09:00:00 2682001 1 x
2 2015-03-01 10:00:00 2015-04-01 08:59:59 2671200 2 y
3 2015-04-01 09:00:00 2015-04-01 12:33:59 12840 3 y/z
4 2015-04-01 12:34:00 2015-05-01 12:00:00 2589961 4 y/z/x
5 2015-05-01 12:00:01 2015-05-01 13:00:00 3600 5 z/x
6 2015-05-01 13:00:01 2015-07-01 11:23:00 5264580 6 x
This approach will be very fast for almost as many intervals as you can fit in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With