# Determine dynamic window width: efficient rolling count of values which satisfy a condition

### Tags:

#### rolling-computation

I have a data.frame with two columns `a` and `b`, where `a` is sorted. I want to get the rolling average of `b`, where the window is the range `a - 5` to `a` (i.e. from the current value of `a` to wherever `a - 5` is).

Performing the rolling average with different window widths is trivial using `data.table::frollmean()` (`adaptive = TRUE`; "each single observation has own corresponding rolling window width"), so the only problem is calculating those window widths.

So, given the following data.frame, how can I determine the window size for each mean?

``````set.seed(42)
x <- data.frame(
a = sort(runif(10, 0, 10)),
b = 1:10
)
x
#>           a  b
#> 1  1.346666  1
#> 2  2.861395  2
#> 3  5.190959  3
#> 4  6.417455  4
#> 5  6.569923  5
#> 6  7.050648  6
#> 7  7.365883  7
#> 8  8.304476  8
#> 9  9.148060  9
#> 10 9.370754 10
``````

Created on 2020-07-03 by the reprex package (v0.3.0)

If I were to put the window size as a new column `n`, I'd expect the result to be

``````#>           a  b n
#> 1  1.346666  1 1
#> 2  2.861395  2 2
#> 3  5.190959  3 3
#> 4  6.417455  4 3
#> 5  6.569923  5 4
#> 6  7.050648  6 5
#> 7  7.365883  7 6
#> 8  8.304476  8 6
#> 9  9.148060  9 7
#> 10 9.370754 10 8
``````

So, for example, there are two values between `a[2] = 2.86` and `2.86 - 5` (including itself), and there are six values between `a[8] = 8.30` and `8.30 - 5`.

I've managed to do this using `outer`:

``````suppressPackageStartupMessages({
library(magrittr)
library(data.table)
})

f <- function(x, y) {
return(y %between% list(x - 5, x))
}

outer(x\$a, x\$a, f) %>% rowSums()
#>  [1] 1 2 3 3 4 5 6 6 7 8
``````

However, my real case has easily 5000 rows, and this method gets quite slow (takes around 10 seconds). One problem I see is that it compares every value of `a` to every other value of `a`, so has to perform some 25,000,000 comparisons. However, I know `a` is sorted, so if we find a stretch of `TRUE` results in the comparison and then a `FALSE`, we know all subsequent results for the current value of `a` will also be `FALSE` (that would mean we were in the allowable range and then moved past the highest allowable value of `a`, so everything else will also be rejected).

So, is there a better, faster way of doing this?

493

#### Wasabi

Because it seems that you will load `data.table` anyway (for `frollmean`), you may coerce your `data.frame` to `data.table`, and add the new column by reference.

`findInterval` is used to find the index of each subtracted value among the original values. This index is then subtracted from the original index, obtained by `.I` or `seq_along`, to get the window size.

``````setDT(x)
x[ , n := .I - findInterval(a - 5, a)]

# x
#            a  b n
#  1: 1.346666  1 1
#  2: 2.861395  2 2
#  3: 5.190959  3 3
#  4: 6.417455  4 3
#  5: 6.569923  5 4
#  6: 7.050648  6 5
#  7: 7.365883  7 6
#  8: 8.304476  8 6
#  9: 9.148060  9 7
# 10: 9.370754 10 8
``````

Similar with `base`:

``````x\$n = seq_along(x\$a) - findInterval(x\$a - 5, x\$a)
``````
197