Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Which is the optimal way to compute functions over time with 3D arrays (latitude, longitude, and time)?

I work a lot with large 3D array (latitude, longitude, and time), with size of for example 720x1440x480. Usually, I need to make operations over time for each latitude and longitude, for example, getting the average (resulting in a 2D array) or getting a rolling mean in time (resulting in a 3D array), or more complex functions.

My question is: which is the package (or way do it) the most efficient and fast?

I know one option is base R, using the apply function and for rolling function mixed with the package zoo that provides the rollapply function. Another way is with tidyverse and another way is with data.table. And combinations between these packages. But is there one that is the fastest?

For example if I have this cube of data:

data <- array(rnorm(721*1440*480),dim = c(721,1440,480))

Which dimensions are latitude, longitude, and time like this:

lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)

And I usually need to do things like this (this is in base R + zoo):

# Average in time
average_data <- apply(data, 1:2, mean)

# Rolling mean, width of window = 3
library(zoo)
rolling_mean <- function(x){
  return(rollapply(data = x, width = 3, by = 1, FUN = mean))
}
rolling_mean_data <- apply(X = data, MARGIN = 1:2,
                      FUN = rolling_mean)
rolling_mean_data <- aperm(a = rolling_mean_data,perm = c(2,3,1))

The functions may change being not necessary always mean, also could be others statistics like standard deviation or correlation with a time series.

So, which is the fastest way to do this type of calculus?

like image 983
Santiago I. Hurtado Avatar asked Oct 07 '20 14:10

Santiago I. Hurtado


2 Answers

Usually data.table is quite fast and memory efficient. You might want to check ?data.table::frollapply and note, that as.data.table provides a S3 method for class 'array'.

The following compares the base R code snippets you provided with their data.table counterparts (I'm not really familiar with tidyverse). I'm focusing only on the calculation itself not on the output format - so for the data.table solution the output isn't converted back to an array. Here is some info on that - also using aperm as you do above.

The benchmark was created using the reduced dataset (I'll update with the results of your datasets once the computing is done).

Edit: Just updated the benchmark results using the entire dataset.

library(data.table)
library(microbenchmark)
library(zoo)

lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)

# data <- array(rnorm(721 * 1440 * 480), dim = c(721, 1440, 480), dimnames = list(lat = lat, lon = lon, time = time))
data <- array(rnorm(72 * 144 * 48), dim = c(72, 144, 48), dimnames = list(lat = NULL, lon  = NULL, time  = NULL)) # reduced dataset
DT <- as.data.table(data)
setorder(DT, time) # as per @jangorecki we need to ensure that DT is sorted by time

rolling_mean <- function(x) {
  return(rollapply(
    data = x,
    width = 3,
    by = 1,
    FUN = mean
  ))
}

microbenchmark(
  "mean - base R" = {
    apply(data, 1:2, mean)
  },
  "mean - data.table" = {
    DT[, .("mean_value" = mean(value)), by = .(lat, lon)]
  },
  times = 1L
)

Unit: seconds
              expr       min        lq      mean    median        uq       max neval
     mean - base R 33.152048 33.152048 33.152048 33.152048 33.152048 33.152048     1
 mean - data.table  8.287379  8.287379  8.287379  8.287379  8.287379  8.287379     1


microbenchmark(
  "rollmean - base R" = {
      apply(X = data,
            MARGIN = 1:2,
            FUN = rolling_mean)
  },
  "rollmean - data.table" = {
    DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
  },
  times = 1L
)

Unit: seconds
                  expr       min        lq      mean    median        uq       max neval
     rollmean - base R 477.26644 477.26644 477.26644 477.26644 477.26644 477.26644     1
 rollmean - data.table  48.80027  48.80027  48.80027  48.80027  48.80027  48.80027     1

As you can see, using data.table provides the results significantly faster.

like image 189
ismirsehregal Avatar answered Nov 10 '22 00:11

ismirsehregal


For efficient means for the array, we can transpose an array and use colSums(...). This is faster than (my favorite) data.table and does not require coercion from an array to a data.frame like object. Thanks to @ismirsehregal for the structure of the benchmarks.

Note, this is the smaller dataset. While there is a conversion time penalty for converting from array -> data.table, the data.table method would likely become faster at some point as it can make use of parallel cores unlike the colMeans approach. I didn't have enough RAM to use the larger data.set.

colMeans(aperm(data, c(3L, 1L, 2L)))

all.equal(apply(data, 1:2, mean), colMeans(aperm(data, c(3L, 1L, 2L))))
## [1] TRUE

microbenchmark::microbenchmark(
  "mean - base R" = apply(data, 1:2, mean),
  "mean - data.table" = DT[, .("mean_value" = mean(value)), by = .(lat, lon)],
  "mean - base colMeans" = colMeans(aperm(data, c(3L, 1L, 2L)))
)

## Unit: milliseconds
##                  expr     min       lq      mean  median      uq      max neval
##         mean - base R 48.6293 52.30380 57.611552 54.8006 61.2880 146.4658   100
##     mean - data.table  7.5610  8.95255 13.004569  9.7459 17.4888  28.7324   100
##  mean - base colMeans  6.1713  6.45505  8.421273  6.6429  7.7849  99.3952   100

For efficient rolling means, zoo::roll_applyr() is intuitive but not fast. We can use a number of package solutions such as data.table::frollmean() or RcppRoll::roll_mean() . See also here for various methods for rolling averages:

Calculating moving average

Or @JanGorecki's question on performance:

Adaptive moving average - top performance in R

I profiled @Matti Pastell's solution and @pipefish's solution from the first link in addition to using data.table::frollmean() with apply:

##@Matti Pastell's
base_ma = function(x, n = 5){stats::filter(x, rep(1 / n, n), sides = 2)}
##@pipefish's
base_ma2 = function(x, n = 5L) {
  cx <- c(0,cumsum(x))
  (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
}

microbenchmark(
  "rollmean - apply zoo" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = rolling_mean)
  },
  "rollmean - apply dt.frollmean" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = frollmean, n = 3L)
  },
  "rollmean - data.table" = {
    DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
  },
  "rollmean - apply stats::filter" = {
    apply(X = data,
        MARGIN = 1:2,
        FUN = base_ma, n = 3L)
  },
  "rollmean - apply w_cumsum" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = base_ma2, n = 3L)
  },
  times = 1L
)
Unit: milliseconds
                           expr       min        lq      mean    median        uq       max neval
           rollmean - apply zoo 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218     1
  rollmean - apply dt.frollmean  331.7787  331.7787  331.7787  331.7787  331.7787  331.7787     1
          rollmean - data.table  358.0787  358.0787  358.0787  358.0787  358.0787  358.0787     1
 rollmean - apply stats::filter  435.5238  435.5238  435.5238  435.5238  435.5238  435.5238     1
      rollmean - apply w_cumsum  148.7428  148.7428  148.7428  148.7428  148.7428  148.7428     1
like image 31
Cole Avatar answered Nov 09 '22 22:11

Cole