I regularly find rolling things of time series (particularly means), and was surprised to find that rollmean
is notably faster than rollapply
, and that the align = 'right'
methods are faster than the rollmeanr
wrappers.
How have they achieved this speed up? And why does one lose some of it when using the rollmeanr()
wrapper?
Some background: I had been using rollapplyr(x, n, function(X) mean(X))
, however I recently happened upon a few examples using rollmean
. The documents suggest rollapplyr(x, n, mean)
(note without the function
part of the argument) uses rollmean
so I didn't think that there would be much difference in performance, however rbenchmark
revealed notable differences.
require(zoo)
require(rbenchmark)
x <- rnorm(1e4)
r1 <- function() rollapplyr(x, 3, mean) # uses rollmean
r2 <- function() rollapplyr(x, 3, function(x) mean(x))
r3 <- function() rollmean(x, 3, na.pad = TRUE, align = 'right')
r4 <- function() rollmeanr(x, 3, align = "right")
bb <- benchmark(r1(), r2(), r3(), r4(),
columns = c('test', 'elapsed', 'relative'),
replications = 100,
order = 'elapsed')
print(bb)
I was surprised to find that rollmean(x, n, align = 'right')
was notably faster -- and ~40x faster than my rollapply(x, n, function(X) mean(X))
approach.
test elapsed relative
3 r3() 0.74 1.000
4 r4() 0.86 1.162
1 r1() 0.98 1.324
2 r2() 27.53 37.203
The difference seems to get larger as the size of the data-set grows. I changed only the size of x
(to rnorm(1e5)
) in the above code and re-ran the test and there was an even larger difference between the functions.
test elapsed relative
3 r3() 13.33 1.000
4 r4() 17.43 1.308
1 r1() 19.83 1.488
2 r2() 279.47 20.965
and for x <- rnorm(1e6)
test elapsed relative
3 r3() 44.23 1.000
4 r4() 54.30 1.228
1 r1() 65.30 1.476
2 r2() 2473.35 55.920
How have they done this? Also, is this the optimal solution? Sure, this is fast but is there an even faster way to do this?
(Note: in general my time series are almost always xts
objects -- does this matter?)
Computing the rolling mean is faster than computing a general rolling function, because the first one is easier to compute. When computing a general rolling function you have to compute the function on each window again and again, which you don't have to do for mean
, because of the simple identity:
(a2 + a3 + ... + an)/(n-1) = (a1 + a2 + ... + a(n-1))/(n-1) + (an - a1)/(n-1)
and you can see how that's leveraged by looking at getAnywhere(rollmean.zoo)
.
If you want an even faster rolling mean, use runmean
from caTools
, which is implemented in C making it much faster (it also scales a lot better so will get even faster as the size of data increases).
library(microbenchmark)
library(caTools)
library(zoo)
x = rnorm(1e4)
microbenchmark(runmean(x, 3, endrule = 'trim', align = 'right'),
rollmean(x, 3, align = 'right'))
#Unit: microseconds
# expr min lq median uq max neval
# runmean(x, 3, endrule = "trim", align = "right") 631.061 740.0775 847.5915 1020.048 1652.109 100
# rollmean(x, 3, align = "right") 7308.947 9155.7155 10627.0210 12760.439 16919.092 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With