I am working with multidimensional array both on R and MATLAB, these arrays have five dimensions (total of 14.5M of elements). I have to remove a dimension applying an arithmetic mean on it and I discovered an amazing difference of performances using the two softwares.
MATLAB:
>> a = rand([144 73 10 6 23]);
>> tic; b = mean(a,3); toc
Elapsed time is 0.014454 seconds.
R:
> a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
> start <- Sys.time (); b = apply(a, c(1,2,4,5), mean); Sys.time () - start
Time difference of 1.229083 mins
I know that apply function is slow because is something like a general purpose function, but I don't know how to deal with this problem because this difference of performances is really a big limit for me. I tried to search for a generalization of colMeans/rowMeans functions but I didn't succeed.
EDIT I'll show a little sample matrix:
> dim(a)
[1] 2 4 3
> dput(aa)
structure(c(7, 8, 5, 8, 10, 11, 9, 9, 6, 12, 9, 10, 12, 10, 14,
12, 7, 9, 8, 10, 10, 9, 8, 6), .Dim = c(2L, 4L, 3L))
a_mean = apply(a, c(2,3), mean)
> a_mean
[,1] [,2] [,3]
[1,] 7.5 9.0 8.0
[2,] 6.5 9.5 9.0
[3,] 10.5 11.0 9.5
[4,] 9.0 13.0 7.0
EDIT (2):
I discovered that applying sum function and then dividing by the size of the removed dimension is definitely faster:
> start <- Sys.time (); aaout = apply(aa, c(1,2,4,5), sum); Sys.time () - start
Time difference of 5.528063 secs
In R, apply
is not the right tool for the task. If you had a matrix and needed the row or column means, you would use the much much faster, vectorized rowMeans
and colMeans
. You can still use these for a multi-dimensional array but you need to be a little creative:
Assuming your array has n
dimensions, and you want to compute means along dimension i
:
aperm
to move the dimension i
to the last position n
rowMeans
with dims = n - 1
Similarly, you could:
aperm
to move the dimension i
to the first positioncolMeans
with dims = 1
a <- array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
means.along <- function(a, i) {
n <- length(dim(a))
b <- aperm(a, c(seq_len(n)[-i], i))
rowMeans(b, dims = n - 1)
}
system.time(z1 <- apply(a, c(1,2,4,5), mean))
# user system elapsed
# 25.132 0.109 25.239
system.time(z2 <- means.along(a, 3))
# user system elapsed
# 0.283 0.007 0.289
identical(z1, z2)
# [1] TRUE
mean
is particularly slow because of S3 method dispatch. This is faster:
set.seed(42)
a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
system.time({b = apply(a, c(1,2,4,5), mean.default)})
# user system elapsed
#16.80 0.03 16.94
If you don't need to handle NA
s you can use the internal function:
system.time({b1 = apply(a, c(1,2,4,5), function(x) .Internal(mean(x)))})
# user system elapsed
# 6.80 0.04 6.86
For comparison:
system.time({b2 = apply(a, c(1,2,4,5), function(x) sum(x)/length(x))})
# user system elapsed
# 9.05 0.01 9.08
system.time({b3 = apply(a, c(1,2,4,5), sum)
b3 = b3/dim(a)[[3]]})
# user system elapsed
# 7.44 0.03 7.47
(Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)
It might be possible to speed this up with an Rcpp implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With