I have a data.frame with two columns a
and b
, where a
is sorted. I want to get the rolling average of b
, where the window is the range a - 5
to a
(i.e. from the current value of a
to wherever a - 5
is).
Performing the rolling average with different window widths is trivial using data.table::frollmean()
(adaptive = TRUE
; "each single observation has own corresponding rolling window width"), so the only problem is calculating those window widths.
So, given the following data.frame, how can I determine the window size for each mean?
set.seed(42)
x <- data.frame(
a = sort(runif(10, 0, 10)),
b = 1:10
)
x
#> a b
#> 1 1.346666 1
#> 2 2.861395 2
#> 3 5.190959 3
#> 4 6.417455 4
#> 5 6.569923 5
#> 6 7.050648 6
#> 7 7.365883 7
#> 8 8.304476 8
#> 9 9.148060 9
#> 10 9.370754 10
Created on 2020-07-03 by the reprex package (v0.3.0)
If I were to put the window size as a new column n
, I'd expect the result to be
#> a b n
#> 1 1.346666 1 1
#> 2 2.861395 2 2
#> 3 5.190959 3 3
#> 4 6.417455 4 3
#> 5 6.569923 5 4
#> 6 7.050648 6 5
#> 7 7.365883 7 6
#> 8 8.304476 8 6
#> 9 9.148060 9 7
#> 10 9.370754 10 8
So, for example, there are two values between a[2] = 2.86
and 2.86 - 5
(including itself), and there are six values between a[8] = 8.30
and 8.30 - 5
.
I've managed to do this using outer
:
suppressPackageStartupMessages({
library(magrittr)
library(data.table)
})
f <- function(x, y) {
return(y %between% list(x - 5, x))
}
outer(x$a, x$a, f) %>% rowSums()
#> [1] 1 2 3 3 4 5 6 6 7 8
However, my real case has easily 5000 rows, and this method gets quite slow (takes around 10 seconds). One problem I see is that it compares every value of a
to every other value of a
, so has to perform some 25,000,000 comparisons. However, I know a
is sorted, so if we find a stretch of TRUE
results in the comparison and then a FALSE
, we know all subsequent results for the current value of a
will also be FALSE
(that would mean we were in the allowable range and then moved past the highest allowable value of a
, so everything else will also be rejected).
So, is there a better, faster way of doing this?
Because it seems that you will load data.table
anyway (for frollmean
), you may coerce your data.frame
to data.table
, and add the new column by reference.
findInterval
is used to find the index of each subtracted value among the original values. This index is then subtracted from the original index, obtained by .I
or seq_along
, to get the window size.
setDT(x)
x[ , n := .I - findInterval(a - 5, a)]
# x
# a b n
# 1: 1.346666 1 1
# 2: 2.861395 2 2
# 3: 5.190959 3 3
# 4: 6.417455 4 3
# 5: 6.569923 5 4
# 6: 7.050648 6 5
# 7: 7.365883 7 6
# 8: 8.304476 8 6
# 9: 9.148060 9 7
# 10: 9.370754 10 8
Similar with base
:
x$n = seq_along(x$a) - findInterval(x$a - 5, x$a)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With