Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to filter one data frame by ranges in another

Tags:

r

dplyr

Let's say I have a data frame containing a bunch of data and a date/time column indicating when each data point was collected. I have another data frame that lists time spans, where a "Start" column indicates the date/time when each span starts and an "End" column indicates the date/time when each span ends.

I've created a dummy example below using simplified data:

main_data = data.frame(Day=c(1:30))

spans_to_filter = 
    data.frame(Span_number = c(1:6),
               Start = c(2,7,1,15,12,23),
               End = c(5,10,4,18,15,26))

I toyed around with a few ways of solving this problem and ended up with the following solution:

require(dplyr)    
filtered.main_data =
    main_data %>% 
    rowwise() %>% 
    mutate(present = any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)) %>% 
    filter(present) %>% 
    data.frame()

This works perfectly fine, but I noticed it can take a while to process if I have a lot of data (I assume because I'm performing a row-wise comparison). I'm still learning the ins-and-outs of R and I was wondering if there is a more efficient way of performing this operation, preferably using dplyr/tidyr?

like image 405
Brain_Food Avatar asked Apr 06 '16 14:04

Brain_Food


2 Answers

In the data.table package starting from v1.9.8, non-equi joins has been implemented. With this, I've created a wrapper function inrange() for exactly these kind of operations, where the task involves finding if a point lies in any of the intervals provided, and if so return TRUE, else FALSE.

require(data.table) # v>=1.9.8
setDT(main_data)[Day %inrange% spans_to_filter[, 2:3]] # inclusive bounds
#     Day
#  1:   1
#  2:   2
#  3:   3
#  4:   4
#  5:   5
#  6:   7
#  7:   8
#  8:   9
#  9:  10
# 10:  12
# 11:  13
# 12:  14
# 13:  15
# 14:  16
# 15:  17
# 16:  18
# 17:  23
# 18:  24
# 19:  25
# 20:  26

See ?inrange for more.

like image 133
Arun Avatar answered Oct 18 '22 16:10

Arun


Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
  rowSums(mapply(function(a,b) between(vals, a, b), 
                 spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>% 
  filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>% 
  filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = main_data %>% 
    filter(Day %in% filt.vals)
  )

Unit: microseconds
     expr      min       lq      mean    median       uq      max neval cld
       OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c
   eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b 
 eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100 a   

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter = 
  data.frame(Span_number = c(1:9),
             Start = c(2,7,1,15,12,23,90,9000,50000),
             End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = {
    filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
    main_data %>% 
      filter(Day %in% filt.vals)}, 
  times=10
  )

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
       OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b
   eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10  a 
 eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10  a 
like image 28
eipi10 Avatar answered Oct 18 '22 16:10

eipi10