Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently center a large matrix in R

I have a large matrix that I would like to center:

X <- matrix(sample(1:10, 5e+08, replace=TRUE), ncol=10000)

Finding the the means is quick and efficient with colMeans:

means <- colMeans(X)

But what's a good (fast and memory efficient) way to subtract the respective mean from each column? This works, but it doesn't feel right:

for (i in 1:length(means)){
  X[,i] <- X[,i]-means[i] 
}

Is there a better way?

/edit: Here's a modification the the various benchmarks DWin wrote, on a larger matrix, including the other posted suggestions:

require(rbenchmark)
X <- matrix(sample(1:10, 5e+07, replace=TRUE), ncol=10000)
frlp.c <- compiler:::cmpfun(function(mat){
  means <- colMeans(mat)
  for (i in 1:length(means)){
    mat[,i] <- mat[,i]-means[i] 
  }
  return(mat)
})

mat.c <- compiler:::cmpfun(function(mat){
  t(t(X) - colMeans(X))
})

swp.c <- compiler:::cmpfun(function(mat){
  sweep(mat, 2, colMeans(mat), FUN='-')
})

scl.c <- compiler:::cmpfun(function(mat){
  scale(mat, scale=FALSE)
})

matmult.c <- compiler:::cmpfun(function(mat){
  mat-rep(1, nrow(mat)) %*% t(colMeans(mat))
})

benchmark( 
  frlp.c=frlp.c(X),
  mat=mat.c(X),
  swp=swp.c(X),
  scl=scl.c(X), 
  matmult=matmult.c(X),
  replications=10,
  order=c('replications', 'elapsed'))

The matmult function seems to be new winner! I really want to try these out on a 5e+08 element matrix, but I keep running out of RAM.

     test replications elapsed relative user.self sys.self user.child sys.child
5 matmult           10   11.98    1.000      7.47     4.47         NA        NA
1  frlp.c           10   35.05    2.926     31.66     3.32         NA        NA
2     mat           10   50.56    4.220     44.52     5.67         NA        NA
4     scl           10   58.86    4.913     50.26     8.42         NA        NA
3     swp           10   61.25    5.113     51.98     8.64         NA        NA
like image 643
Zach Avatar asked Sep 08 '12 16:09

Zach


3 Answers

This seems to be about twice as fast as sweep().

X - rep(1, nrow(X)) %*% t(colMeans(X))

X <- matrix(sample(1:10, 5e+06, replace=TRUE), ncol=10000)
system.time(sweep(X, 2, colMeans(X)))
   user  system elapsed 
   0.33    0.00    0.33 
system.time(X - rep(1, nrow(X)) %*% t(colMeans(X)))
   user  system elapsed 
   0.15    0.03    0.19 

DWin edit: When I did this with a smaller matrix than the OP used (only 5e+07) I get these timings, where Josh's is mat2 (The larger one overflowed into virtual memory on my Mac w/ 32GB and needed to be terminated) :

  test replications elapsed relative user.self sys.self user.child sys.child
2 mat2            1   0.546 1.000000     0.287    0.262          0         0
3  mat            1   2.372 4.344322     1.569    0.812          0         0
1 frlp            1   2.520 4.615385     1.720    0.809          0         0
4  swp            1   2.990 5.476190     1.959    1.043          0         0
5  scl            1   3.019 5.529304     1.984    1.046          0         0
like image 137
Josh O'Brien Avatar answered Nov 19 '22 21:11

Josh O'Brien


Could this be useful for you?

sweep(X, 2, colMeans(X)) # this substracts the colMean to each col
scale(X, center=TRUE, scale=FALSE) # the same

sweep(X, 2, colMeans(X), FUN='/') # this makes division

If you want to speed up your code based on the for loop you can use cmpfun from compiler package. Example

X <- matrix(sample(1:10, 500000, replace=TRUE), ncol=100) # some data
means <- colMeans(X) # col means

library(compiler)

# One of your functions to be compiled and tested
Mean <- function(x) {
  for (i in 1:length(means)){
      X[,i] <- X[,i]-means[i] 
  }
  return(X)
}



CMean <- cmpfun(Mean) # compiling the Mean function

system.time(Mean(X))
   user  system elapsed 
  0.028   0.016   0.101 
system.time(CMean(X))
   user  system elapsed 
  0.028   0.012   0.066 

Maybe this suggestion could help you.

like image 26
Jilber Urbina Avatar answered Nov 19 '22 20:11

Jilber Urbina


I can see why Jilber was uncertain regarding what you wanted, since at one point you ask for division but in your code you use subtraction. The sweep operation he suggests is superfluous here. Just using scale would do it:

 cX <- scale(X, scale=FALSE) # does the centering with subtraction of col-means
 sX <- scale(X, center=FALSE) # does the scaling operation
 csX <- scale(X) # does both

(It's hard to believe that scale is slower. Look at it's code. Uses sweep on columns)

 scale.default # since it's visible.

A matrix approach:

t( t(X) / colMeans(X) )

Edit: Some timings (I was wrong about scale being equivalent to sweep-colMeans):

require(rbenchmark)
benchmark(
    mat={sX <- t( t(X) / colMeans(X) ) },
    swp ={swX <- sweep(X, 2, colMeans(X), FUN='/')},
    scl={sX <- scale(X, center=FALSE)}, 
    replications=10^2,
    order=c('replications', 'elapsed'))
#-----------
  test replications elapsed relative user.self sys.self user.child sys.child
1  mat          100   0.015 1.000000     0.015        0          0         0
2  swp          100   0.015 1.000000     0.015        0          0         0
3  scl          100   0.025 1.666667     0.025        0          0         0

Some funny things happen when you scal this up. The above timigns were mad with samall matrix-X. Below is with something closer to what you were using:

     benchmark( 
        frlp ={means <- colMeans(X)
                       for (i in 1:length(means)){
                              X[,i] <- X[,i]-means[i] 
                                }
                      },
         mat={sX <- t( t(X) - colMeans(X) )    },
         swp ={swX <- sweep(X, 2, colMeans(X), FUN='-')},
         scl={sX <- scale(X, scale=FALSE)}, 
     replications=10^2,
     order=c('replications', 'elapsed'))
#    
  test replications elapsed relative user.self sys.self user.child sys.child
2  mat          100   2.075 1.000000     1.262    0.820          0         0
3  swp          100   2.964 1.428434     1.917    1.058          0         0
4  scl          100   2.981 1.436627     1.935    1.059          0         0
1 frlp          100   3.651 1.759518     2.540    1.128          0         0
like image 3
IRTFM Avatar answered Nov 19 '22 21:11

IRTFM