Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient apply or mapply for multiple matrix arguments by row

Tags:

r

I have two matrices that I want to apply a function to, by rows:

matrixA
           GSM83009  GSM83037  GSM83002  GSM83029  GSM83041
100001_at  5.873321  5.416164  3.512227  6.064150  3.713696
100005_at  5.807870  6.810829  6.105804  6.644000  6.142413
100006_at  2.757023  4.144046  1.622930  1.831877  3.694880

matrixB
          GSM82939 GSM82940 GSM82974 GSM82975
100001_at 3.673556 2.372952 3.228049 3.555816
100005_at 6.916954 6.909533 6.928252 7.003377
100006_at 4.277985 4.856986 3.670161 4.075533

I've found several similar questions, but not a whole lot of answers: mapply for matrices, Multi matrix row-wise mapply?. The code I have now splits the matrices by row into lists, but having to split it makes it rather slow and not much faster than a for loop, considering I have almost 9000 rows in each matrix:

scores <- mapply(t.test.stat, split(matrixA, row(matrixA)), split(matrixB, row(matrixB)))

The function itself is very simple, just finding the t-value:

t.test.stat <- function(x, y)
{
    return( (mean(x) - mean(y)) / sqrt(var(x)/length(x) + var(y)/length(y)) )
}
like image 905
Edd Avatar asked Apr 11 '11 19:04

Edd


People also ask

How do I apply the same function to all rows and columns of a matrix in R?

One of the most famous and most used features of R is the *apply() family of functions, such as apply() , tapply() , and lapply() . Here, we'll look at apply() , which instructs R to call a user-specified function on each of the rows or each of the columns of a matrix.

How do you apply a function to each row of a matrix in R?

You can use the apply() function to apply a function to each row in a matrix or data frame in R. where: X: Name of the matrix or data frame. MARGIN: Dimension to perform operation across.

How many arguments are present in apply() function in R?

Each of the apply functions requires a minimum of two arguments: an object and another function. The function can be any inbuilt (like mean, sum, max etc.) or user-defined function.


1 Answers

Splitting the matrices isn't the biggest contributor to evaluation time.

set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)

system.time( scores <- mapply(t.test.stat,
    split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
#    user  system elapsed 
#    1.57    0.00    1.58 
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
#    user  system elapsed 
#    1.14    0.00    1.14 

Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.). Basically, there's quite a bit of overhead from function calls.

Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()

You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.

Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):

system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
  sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
        rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
#    user  system elapsed 
#      0       0       0 
head(ans)
# [1]  0.8272511 -1.0965269  0.9862844 -0.6026452 -0.2477661  1.1896181

UPDATE
Here's a "cleaner" version using a rowVars function:

rowVars <- function(x, na.rm=FALSE, dims=1L) {
  rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
  sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )
like image 55
Joshua Ulrich Avatar answered Nov 11 '22 21:11

Joshua Ulrich