I work a lot with large 3D array (latitude, longitude, and time), with size of for example 720x1440x480. Usually, I need to make operations over time for each latitude and longitude, for example, getting the average (resulting in a 2D array) or getting a rolling mean in time (resulting in a 3D array), or more complex functions.
My question is: which is the package (or way do it) the most efficient and fast?
I know one option is base R, using the apply function and for rolling function mixed with the package zoo that provides the rollapply function. Another way is with tidyverse and another way is with data.table. And combinations between these packages. But is there one that is the fastest?
For example if I have this cube of data:
data <- array(rnorm(721*1440*480),dim = c(721,1440,480))
Which dimensions are latitude, longitude, and time like this:
lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)
And I usually need to do things like this (this is in base R + zoo):
# Average in time
average_data <- apply(data, 1:2, mean)
# Rolling mean, width of window = 3
library(zoo)
rolling_mean <- function(x){
return(rollapply(data = x, width = 3, by = 1, FUN = mean))
}
rolling_mean_data <- apply(X = data, MARGIN = 1:2,
FUN = rolling_mean)
rolling_mean_data <- aperm(a = rolling_mean_data,perm = c(2,3,1))
The functions may change being not necessary always mean, also could be others statistics like standard deviation or correlation with a time series.
So, which is the fastest way to do this type of calculus?
Usually data.table
is quite fast and memory efficient. You might want to check ?data.table::frollapply
and note, that as.data.table
provides a S3 method for class 'array'.
The following compares the base R code snippets you provided with their data.table
counterparts (I'm not really familiar with tidyverse). I'm focusing only on the calculation itself not on the output format - so for the data.table
solution the output isn't converted back to an array. Here is some info on that - also using aperm
as you do above.
The benchmark was created using the reduced dataset (I'll update with the results of your datasets once the computing is done).
Edit: Just updated the benchmark results using the entire dataset.
library(data.table)
library(microbenchmark)
library(zoo)
lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)
# data <- array(rnorm(721 * 1440 * 480), dim = c(721, 1440, 480), dimnames = list(lat = lat, lon = lon, time = time))
data <- array(rnorm(72 * 144 * 48), dim = c(72, 144, 48), dimnames = list(lat = NULL, lon = NULL, time = NULL)) # reduced dataset
DT <- as.data.table(data)
setorder(DT, time) # as per @jangorecki we need to ensure that DT is sorted by time
rolling_mean <- function(x) {
return(rollapply(
data = x,
width = 3,
by = 1,
FUN = mean
))
}
microbenchmark(
"mean - base R" = {
apply(data, 1:2, mean)
},
"mean - data.table" = {
DT[, .("mean_value" = mean(value)), by = .(lat, lon)]
},
times = 1L
)
Unit: seconds
expr min lq mean median uq max neval
mean - base R 33.152048 33.152048 33.152048 33.152048 33.152048 33.152048 1
mean - data.table 8.287379 8.287379 8.287379 8.287379 8.287379 8.287379 1
microbenchmark(
"rollmean - base R" = {
apply(X = data,
MARGIN = 1:2,
FUN = rolling_mean)
},
"rollmean - data.table" = {
DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
},
times = 1L
)
Unit: seconds
expr min lq mean median uq max neval
rollmean - base R 477.26644 477.26644 477.26644 477.26644 477.26644 477.26644 1
rollmean - data.table 48.80027 48.80027 48.80027 48.80027 48.80027 48.80027 1
As you can see, using data.table
provides the results significantly faster.
For efficient means for the array, we can transpose an array and use colSums(...)
. This is faster than (my favorite) data.table and does not require coercion from an array to a data.frame like object. Thanks to @ismirsehregal for the structure of the benchmarks.
Note, this is the smaller dataset. While there is a conversion time penalty for converting from array -> data.table, the data.table
method would likely become faster at some point as it can make use of parallel cores unlike the colMeans
approach. I didn't have enough RAM to use the larger data.set.
colMeans(aperm(data, c(3L, 1L, 2L)))
all.equal(apply(data, 1:2, mean), colMeans(aperm(data, c(3L, 1L, 2L))))
## [1] TRUE
microbenchmark::microbenchmark(
"mean - base R" = apply(data, 1:2, mean),
"mean - data.table" = DT[, .("mean_value" = mean(value)), by = .(lat, lon)],
"mean - base colMeans" = colMeans(aperm(data, c(3L, 1L, 2L)))
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## mean - base R 48.6293 52.30380 57.611552 54.8006 61.2880 146.4658 100
## mean - data.table 7.5610 8.95255 13.004569 9.7459 17.4888 28.7324 100
## mean - base colMeans 6.1713 6.45505 8.421273 6.6429 7.7849 99.3952 100
For efficient rolling means, zoo::roll_applyr()
is intuitive but not fast. We can use a number of package solutions such as data.table::frollmean()
or RcppRoll::roll_mean()
. See also here for various methods for rolling averages:
Calculating moving average
Or @JanGorecki's question on performance:
Adaptive moving average - top performance in R
I profiled @Matti Pastell's solution and @pipefish's solution from the first link in addition to using data.table::frollmean()
with apply
:
##@Matti Pastell's
base_ma = function(x, n = 5){stats::filter(x, rep(1 / n, n), sides = 2)}
##@pipefish's
base_ma2 = function(x, n = 5L) {
cx <- c(0,cumsum(x))
(cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
}
microbenchmark(
"rollmean - apply zoo" = {
apply(X = data,
MARGIN = 1:2,
FUN = rolling_mean)
},
"rollmean - apply dt.frollmean" = {
apply(X = data,
MARGIN = 1:2,
FUN = frollmean, n = 3L)
},
"rollmean - data.table" = {
DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
},
"rollmean - apply stats::filter" = {
apply(X = data,
MARGIN = 1:2,
FUN = base_ma, n = 3L)
},
"rollmean - apply w_cumsum" = {
apply(X = data,
MARGIN = 1:2,
FUN = base_ma2, n = 3L)
},
times = 1L
)
Unit: milliseconds
expr min lq mean median uq max neval
rollmean - apply zoo 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 1
rollmean - apply dt.frollmean 331.7787 331.7787 331.7787 331.7787 331.7787 331.7787 1
rollmean - data.table 358.0787 358.0787 358.0787 358.0787 358.0787 358.0787 1
rollmean - apply stats::filter 435.5238 435.5238 435.5238 435.5238 435.5238 435.5238 1
rollmean - apply w_cumsum 148.7428 148.7428 148.7428 148.7428 148.7428 148.7428 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With