Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine dynamic window width: efficient rolling count of values which satisfy a condition

I have a data.frame with two columns a and b, where a is sorted. I want to get the rolling average of b, where the window is the range a - 5 to a (i.e. from the current value of a to wherever a - 5 is).

Performing the rolling average with different window widths is trivial using data.table::frollmean() (adaptive = TRUE; "each single observation has own corresponding rolling window width"), so the only problem is calculating those window widths.

So, given the following data.frame, how can I determine the window size for each mean?

set.seed(42)
x <- data.frame(
    a = sort(runif(10, 0, 10)),
    b = 1:10
)
x
#>           a  b
#> 1  1.346666  1
#> 2  2.861395  2
#> 3  5.190959  3
#> 4  6.417455  4
#> 5  6.569923  5
#> 6  7.050648  6
#> 7  7.365883  7
#> 8  8.304476  8
#> 9  9.148060  9
#> 10 9.370754 10

Created on 2020-07-03 by the reprex package (v0.3.0)

If I were to put the window size as a new column n, I'd expect the result to be

#>           a  b n
#> 1  1.346666  1 1
#> 2  2.861395  2 2
#> 3  5.190959  3 3
#> 4  6.417455  4 3
#> 5  6.569923  5 4
#> 6  7.050648  6 5
#> 7  7.365883  7 6
#> 8  8.304476  8 6
#> 9  9.148060  9 7
#> 10 9.370754 10 8

So, for example, there are two values between a[2] = 2.86 and 2.86 - 5 (including itself), and there are six values between a[8] = 8.30 and 8.30 - 5.

I've managed to do this using outer:

suppressPackageStartupMessages({
    library(magrittr)
    library(data.table)
})

f <- function(x, y) {
    return(y %between% list(x - 5, x))
}

outer(x$a, x$a, f) %>% rowSums()
#>  [1] 1 2 3 3 4 5 6 6 7 8

However, my real case has easily 5000 rows, and this method gets quite slow (takes around 10 seconds). One problem I see is that it compares every value of a to every other value of a, so has to perform some 25,000,000 comparisons. However, I know a is sorted, so if we find a stretch of TRUE results in the comparison and then a FALSE, we know all subsequent results for the current value of a will also be FALSE (that would mean we were in the allowable range and then moved past the highest allowable value of a, so everything else will also be rejected).

So, is there a better, faster way of doing this?

like image 493
Wasabi Avatar asked Jan 24 '23 22:01

Wasabi


1 Answers

Because it seems that you will load data.table anyway (for frollmean), you may coerce your data.frame to data.table, and add the new column by reference.

findInterval is used to find the index of each subtracted value among the original values. This index is then subtracted from the original index, obtained by .I or seq_along, to get the window size.

setDT(x)
x[ , n := .I - findInterval(a - 5, a)]

# x
#            a  b n
#  1: 1.346666  1 1
#  2: 2.861395  2 2
#  3: 5.190959  3 3
#  4: 6.417455  4 3
#  5: 6.569923  5 4
#  6: 7.050648  6 5
#  7: 7.365883  7 6
#  8: 8.304476  8 6
#  9: 9.148060  9 7
# 10: 9.370754 10 8

Similar with base:

x$n = seq_along(x$a) - findInterval(x$a - 5, x$a)
like image 197
Henrik Avatar answered Feb 13 '23 04:02

Henrik