I have a large matrix that I would like to center:
X <- matrix(sample(1:10, 5e+08, replace=TRUE), ncol=10000)
Finding the the means is quick and efficient with colMeans:
means <- colMeans(X)
But what's a good (fast and memory efficient) way to subtract the respective mean from each column? This works, but it doesn't feel right:
for (i in 1:length(means)){
X[,i] <- X[,i]-means[i]
}
Is there a better way?
/edit: Here's a modification the the various benchmarks DWin wrote, on a larger matrix, including the other posted suggestions:
require(rbenchmark)
X <- matrix(sample(1:10, 5e+07, replace=TRUE), ncol=10000)
frlp.c <- compiler:::cmpfun(function(mat){
means <- colMeans(mat)
for (i in 1:length(means)){
mat[,i] <- mat[,i]-means[i]
}
return(mat)
})
mat.c <- compiler:::cmpfun(function(mat){
t(t(X) - colMeans(X))
})
swp.c <- compiler:::cmpfun(function(mat){
sweep(mat, 2, colMeans(mat), FUN='-')
})
scl.c <- compiler:::cmpfun(function(mat){
scale(mat, scale=FALSE)
})
matmult.c <- compiler:::cmpfun(function(mat){
mat-rep(1, nrow(mat)) %*% t(colMeans(mat))
})
benchmark(
frlp.c=frlp.c(X),
mat=mat.c(X),
swp=swp.c(X),
scl=scl.c(X),
matmult=matmult.c(X),
replications=10,
order=c('replications', 'elapsed'))
The matmult function seems to be new winner! I really want to try these out on a 5e+08 element matrix, but I keep running out of RAM.
test replications elapsed relative user.self sys.self user.child sys.child
5 matmult 10 11.98 1.000 7.47 4.47 NA NA
1 frlp.c 10 35.05 2.926 31.66 3.32 NA NA
2 mat 10 50.56 4.220 44.52 5.67 NA NA
4 scl 10 58.86 4.913 50.26 8.42 NA NA
3 swp 10 61.25 5.113 51.98 8.64 NA NA
This seems to be about twice as fast as sweep()
.
X - rep(1, nrow(X)) %*% t(colMeans(X))
X <- matrix(sample(1:10, 5e+06, replace=TRUE), ncol=10000)
system.time(sweep(X, 2, colMeans(X)))
user system elapsed
0.33 0.00 0.33
system.time(X - rep(1, nrow(X)) %*% t(colMeans(X)))
user system elapsed
0.15 0.03 0.19
DWin edit: When I did this with a smaller matrix than the OP used (only 5e+07) I get these timings, where Josh's is mat2 (The larger one overflowed into virtual memory on my Mac w/ 32GB and needed to be terminated) :
test replications elapsed relative user.self sys.self user.child sys.child
2 mat2 1 0.546 1.000000 0.287 0.262 0 0
3 mat 1 2.372 4.344322 1.569 0.812 0 0
1 frlp 1 2.520 4.615385 1.720 0.809 0 0
4 swp 1 2.990 5.476190 1.959 1.043 0 0
5 scl 1 3.019 5.529304 1.984 1.046 0 0
Could this be useful for you?
sweep(X, 2, colMeans(X)) # this substracts the colMean to each col
scale(X, center=TRUE, scale=FALSE) # the same
sweep(X, 2, colMeans(X), FUN='/') # this makes division
If you want to speed up your code based on the for
loop you can use cmpfun
from compiler
package. Example
X <- matrix(sample(1:10, 500000, replace=TRUE), ncol=100) # some data
means <- colMeans(X) # col means
library(compiler)
# One of your functions to be compiled and tested
Mean <- function(x) {
for (i in 1:length(means)){
X[,i] <- X[,i]-means[i]
}
return(X)
}
CMean <- cmpfun(Mean) # compiling the Mean function
system.time(Mean(X))
user system elapsed
0.028 0.016 0.101
system.time(CMean(X))
user system elapsed
0.028 0.012 0.066
Maybe this suggestion could help you.
I can see why Jilber was uncertain regarding what you wanted, since at one point you ask for division but in your code you use subtraction. The sweep operation he suggests is superfluous here. Just using scale would do it:
cX <- scale(X, scale=FALSE) # does the centering with subtraction of col-means
sX <- scale(X, center=FALSE) # does the scaling operation
csX <- scale(X) # does both
(It's hard to believe that scale
is slower. Look at it's code. Uses sweep
on columns)
scale.default # since it's visible.
A matrix approach:
t( t(X) / colMeans(X) )
Edit: Some timings (I was wrong about scale
being equivalent to sweep-colMeans):
require(rbenchmark)
benchmark(
mat={sX <- t( t(X) / colMeans(X) ) },
swp ={swX <- sweep(X, 2, colMeans(X), FUN='/')},
scl={sX <- scale(X, center=FALSE)},
replications=10^2,
order=c('replications', 'elapsed'))
#-----------
test replications elapsed relative user.self sys.self user.child sys.child
1 mat 100 0.015 1.000000 0.015 0 0 0
2 swp 100 0.015 1.000000 0.015 0 0 0
3 scl 100 0.025 1.666667 0.025 0 0 0
Some funny things happen when you scal this up. The above timigns were mad with samall matrix-X. Below is with something closer to what you were using:
benchmark(
frlp ={means <- colMeans(X)
for (i in 1:length(means)){
X[,i] <- X[,i]-means[i]
}
},
mat={sX <- t( t(X) - colMeans(X) ) },
swp ={swX <- sweep(X, 2, colMeans(X), FUN='-')},
scl={sX <- scale(X, scale=FALSE)},
replications=10^2,
order=c('replications', 'elapsed'))
#
test replications elapsed relative user.self sys.self user.child sys.child
2 mat 100 2.075 1.000000 1.262 0.820 0 0
3 swp 100 2.964 1.428434 1.917 1.058 0 0
4 scl 100 2.981 1.436627 1.935 1.059 0 0
1 frlp 100 3.651 1.759518 2.540 1.128 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With