I have a list of large matrices. All these matrices have the same number of rows and I want to "unlist" them and bind all their columns together. Below is a piece of code that I wrote, but I am not sure if this is the best I can achieve in terms of computational efficiency.
# simulate
n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}
> system.time( res <- matrix( as.numeric( unlist(test) ) ,nr,nc*n) )
user system elapsed
0.114 0.006 0.120
To work on a list and call a function on all objects, do.call
is my usual first idea, along with cbind
here to bind by column all objects.
For n=100
(with others answers for sake of completeness):
n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}
require(data.table)
ori <- function() { matrix( as.numeric( unlist(test) ) ,nr,nc*n) }
Tensibai <- function() { do.call(cbind,test) }
BrodieG <- function() { `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) }
nicola <- function() { setattr(unlist(test),"dim",c(nr,nc*n)) }
library(microbenchmark)
microbenchmark(r1 <- ori(),
r2 <- Tensibai(),
r3 <- BrodieG(),
r4 <- nicola(), times=10)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
r1 <- ori() 23.834673 24.287391 39.49451 27.066844 29.737964 93.74249 10 a
r2 <- Tensibai() 17.416232 17.706165 18.18665 17.873083 18.192238 21.29512 10 a
r3 <- BrodieG() 6.009344 6.145045 21.63073 8.690869 10.323845 77.95325 10 a
r4 <- nicola() 5.912984 6.106273 13.52697 6.273904 6.678156 75.40914 10 a
As for the why (in comments), @nicola did give the answer about it, there's less copy than original method.
All methods gives the same result:
> identical(r1,r2,r3,r4)
[1] TRUE
It seems that do.call
beats the other method due to a copy made during the matrix
call. What is interesting is that you can avoid that copy using the data.table::setattr
function which allows to set attributes by reference, avoiding any copy. I omitted also the as.numeric
part, since it is not necessary (unlist(test)
is already numeric
). So:
require(microbenchmark)
require(data.table)
f1<-function() setattr(unlist(test),"dim",c(nr,nc*n))
f2<-function() do.call(cbind,test)
microbenchmark(res <-f1(),res2 <- f2(),times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# res <- f1() 4.088455 4.183504 7.540913 4.44109 4.988605 35.05378 10
#res2 <- f2() 18.325302 18.379328 18.776834 18.66857 19.100681 19.47415 10
identical(res,res2)
#[1] TRUE
I think I have a better one. We can avoid some of the overhead from cbind
since we know these all have the same number of rows and columns. Instead, we use c
knowing that the underlying vector nature of the matrices will allow us to re-wrap them into the correct dimensions:
microbenchmark(
x <- `attr<-`(do.call(c, test), "dim", c(nr, nc * n)),
y <- do.call(cbind, test)
)
# Unit: milliseconds
# expr min lq
# x <- `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) 4.435943 4.699006
# y <- do.call(cbind, test) 19.339477 19.567063
# mean median uq max neval cld
# 12.76214 5.209938 9.095001 379.77856 100 a
# 21.64878 20.000279 24.210848 26.02499 100 b
identical(x, y)
# [1] TRUE
If you have varying number of columns you can probably still do this with some care in computing the total number of columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With