Suppose I have a large matrix:
M <- matrix(rnorm(1e7),nrow=20)
Further suppose that each column represents a sample. Say I would like to apply t.test()
to each column, is there a way to do this that is much faster than using apply()
?
apply(M, 2, t.test)
It took slightly less than 2 minutes to run the analysis on my computer:
> system.time(invisible( apply(M, 2, t.test)))
user system elapsed
113.513 0.663 113.519
If you have a multicore machine there are some gains from using all the cores, for example using mclapply
.
> library(multicore)
> M <- matrix(rnorm(40),nrow=20)
> x1 <- apply(M, 2, t.test)
> x2 <- mclapply(1:dim(M)[2], function(i) t.test(M[,i]))
> all.equal(x1, x2)
[1] "Component 1: Component 9: 1 string mismatch" "Component 2: Component 9: 1 string mismatch"
# str(x1) and str(x2) show that the difference is immaterial
This mini-example shows that things go as we planned. Now scale up:
> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, t.test)))
user system elapsed
101.346 0.626 101.859
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i]))))
user system elapsed
55.049 2.527 43.668
This is using 8 virtual cores. Your mileage may vary. Not a huge gain, but it comes from very little effort.
EDIT
If you only care about the t-statistic itself, extracting the corresponding field ($statistic
) makes things a bit faster, in particular in the multicore case:
> system.time(invisible(apply(M, 2, function(c) t.test(c)$statistic)))
user system elapsed
80.920 0.437 82.109
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i])$statistic)))
user system elapsed
21.246 1.367 24.107
Or even faster, compute the t value directly
my.t.test <- function(c){
n <- sqrt(length(c))
mean(c)*n/sd(c)
}
Then
> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
user system elapsed
21.371 0.247 21.532
> system.time(invisible(mclapply(1:dim(M)[2], function(i) my.t.test(M[,i]))))
user system elapsed
144.161 8.658 6.313
You can do better than this with the colttests
function from the genefilter
package (on Bioconductor).
> library(genefilter)
> M <- matrix(rnorm(40),nrow=20)
> my.t.test <- function(c){
+ n <- sqrt(length(c))
+ mean(c)*n/sd(c)
+ }
> x1 <- apply(M, 2, function(c) my.t.test(c))
> x2 <- colttests(M, gl(1, nrow(M)))[,"statistic"]
> all.equal(x1, x2)
[1] TRUE
> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
user system elapsed
27.386 0.004 27.445
> system.time(invisible(colttests(M, gl(1, nrow(M)))[,"statistic"]))
user system elapsed
0.412 0.000 0.414
Ref: "Computing thousands of test statistics simultaneously in R", SCGN, Vol 18 (1), 2007, http://stat-computing.org/newsletter/issues/scgn-18-1.pdf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With