Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Arithmetic mean on a multidimensional array on R and MATLAB: drastic difference of performances

I am working with multidimensional array both on R and MATLAB, these arrays have five dimensions (total of 14.5M of elements). I have to remove a dimension applying an arithmetic mean on it and I discovered an amazing difference of performances using the two softwares.

MATLAB:

>> a = rand([144  73  10   6  23]);
>> tic; b = mean(a,3); toc
Elapsed time is 0.014454 seconds.

R:

> a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
> start <- Sys.time (); b = apply(a, c(1,2,4,5), mean); Sys.time () - start
Time difference of 1.229083 mins

I know that apply function is slow because is something like a general purpose function, but I don't know how to deal with this problem because this difference of performances is really a big limit for me. I tried to search for a generalization of colMeans/rowMeans functions but I didn't succeed.

EDIT I'll show a little sample matrix:

> dim(a)
[1] 2 4 3
> dput(aa)
structure(c(7, 8, 5, 8, 10, 11, 9, 9, 6, 12, 9, 10, 12, 10, 14, 
12, 7, 9, 8, 10, 10, 9, 8, 6), .Dim = c(2L, 4L, 3L))
a_mean = apply(a, c(2,3), mean)
> a_mean
     [,1] [,2] [,3]
[1,]  7.5  9.0  8.0
[2,]  6.5  9.5  9.0
[3,] 10.5 11.0  9.5
[4,]  9.0 13.0  7.0

EDIT (2):

I discovered that applying sum function and then dividing by the size of the removed dimension is definitely faster:

> start <- Sys.time (); aaout = apply(aa, c(1,2,4,5), sum); Sys.time () - start
Time difference of 5.528063 secs
like image 204
Matteo De Felice Avatar asked Sep 05 '13 08:09

Matteo De Felice


2 Answers

In R, apply is not the right tool for the task. If you had a matrix and needed the row or column means, you would use the much much faster, vectorized rowMeans and colMeans. You can still use these for a multi-dimensional array but you need to be a little creative:

Assuming your array has n dimensions, and you want to compute means along dimension i:

  1. use aperm to move the dimension i to the last position n
  2. use rowMeans with dims = n - 1

Similarly, you could:

  1. use aperm to move the dimension i to the first position
  2. use colMeans with dims = 1

a <- array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))

means.along <- function(a, i) {
  n <- length(dim(a))
  b <- aperm(a, c(seq_len(n)[-i], i))
  rowMeans(b, dims = n - 1)
}

system.time(z1 <- apply(a, c(1,2,4,5), mean))
#    user  system elapsed 
#  25.132   0.109  25.239 
system.time(z2 <- means.along(a, 3))
#    user  system elapsed 
#   0.283   0.007   0.289 

identical(z1, z2)
# [1] TRUE
like image 105
flodel Avatar answered Sep 22 '22 12:09

flodel


mean is particularly slow because of S3 method dispatch. This is faster:

set.seed(42)
a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))

system.time({b = apply(a, c(1,2,4,5), mean.default)})
# user  system elapsed 
#16.80    0.03   16.94

If you don't need to handle NAs you can use the internal function:

system.time({b1 = apply(a, c(1,2,4,5),  function(x) .Internal(mean(x)))})
# user  system elapsed 
# 6.80    0.04    6.86

For comparison:

system.time({b2 = apply(a, c(1,2,4,5),  function(x) sum(x)/length(x))})
# user  system elapsed 
# 9.05    0.01    9.08 

system.time({b3 = apply(a, c(1,2,4,5),  sum)
             b3 = b3/dim(a)[[3]]})
# user  system elapsed 
# 7.44    0.03    7.47

(Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)

It might be possible to speed this up with an Rcpp implementation.

like image 40
Roland Avatar answered Sep 26 '22 12:09

Roland