Let's say I have a data frame containing a bunch of data and a date/time column indicating when each data point was collected. I have another data frame that lists time spans, where a "Start" column indicates the date/time when each span starts and an "End" column indicates the date/time when each span ends.
I've created a dummy example below using simplified data:
main_data = data.frame(Day=c(1:30))
spans_to_filter =
data.frame(Span_number = c(1:6),
Start = c(2,7,1,15,12,23),
End = c(5,10,4,18,15,26))
I toyed around with a few ways of solving this problem and ended up with the following solution:
require(dplyr)
filtered.main_data =
main_data %>%
rowwise() %>%
mutate(present = any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)) %>%
filter(present) %>%
data.frame()
This works perfectly fine, but I noticed it can take a while to process if I have a lot of data (I assume because I'm performing a row-wise comparison). I'm still learning the ins-and-outs of R and I was wondering if there is a more efficient way of performing this operation, preferably using dplyr/tidyr?
In the data.table package starting from v1.9.8, non-equi joins has been implemented. With this, I've created a wrapper function inrange()
for exactly these kind of operations, where the task involves finding if a point lies in any of the intervals provided, and if so return TRUE
, else FALSE
.
require(data.table) # v>=1.9.8
setDT(main_data)[Day %inrange% spans_to_filter[, 2:3]] # inclusive bounds
# Day
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 7
# 7: 8
# 8: 9
# 9: 10
# 10: 12
# 11: 13
# 12: 14
# 13: 15
# 14: 16
# 15: 17
# 16: 18
# 17: 23
# 18: 24
# 19: 25
# 20: 26
See ?inrange
for more.
Here's a function that you can run in dplyr
to find dates within a given range using the between
function (from dplyr
). For each value of Day
, mapply
runs between
on each of the pairs of Start
and End
dates and the function uses rowSums
to return TRUE
if Day
is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.
test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}
main_data %>%
filter(test.overlap(Day))
If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):
filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)
Now compare execution speeds. I shortened your code to require only the filtering operation:
library(microbenchmark)
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)
Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a
UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between
method, and far greater still for the second method).
main_data = data.frame(Day=c(1:100000))
spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With