I want to compute ordinary least square (OLS) estimates in R without using "lm", and this for several reasons. First, "lm" also computes lots of stuff I don't need (such as the fitted values) considering that data size is an issue in my case. Second, I want to be able to implement OLS myself in R before doing it in another language (eg. in C with the GSL).
As you may know, the model is: Y=Xb+E; with E ~ N(0, sigma^2). As detailed below, b is a vector with 2 parameters, the mean (b0) and another coefficients (b1). At the end, for each linear regression I will do, I want the estimate for b1 (effect size), its standard error, the estimate for sigma^2 (residual variance), and R^2 (determination coef).
Here are my data. I have N samples (eg. individuals, N~=100). For each sample, I have Y outputs (response variables, Y~=10^3) and X points (explanatory variables, X~=10^6). I want to treat the Y outputs separately, ie. I want to launch Y linear regressions: one for output 1, one for output 2, etc. Moreover, I want to use explanatory variables one y one: for output 1, I want to regress it on point 1, then on point 2, then ... finally on point X. (I hope it's clear...!)
Here is my R code to check the speed of "lm" versus computing OLS estimates myself via matrix algebra.
First, I simulate dummy data:
nb.samples <- 10 # N
nb.points <- 1000 # X
x <- matrix(data=replicate(nb.samples,sample(x=0:2,size=nb.points, replace=T)),
nrow=nb.points, ncol=nb.samples, byrow=F,
dimnames=list(points=paste("p",seq(1,nb.points),sep=""),
samples=paste("s",seq(1,nb.samples),sep="")))
nb.outputs <- 10 # Y
y <- matrix(data=replicate(nb.outputs,rnorm(nb.samples)),
nrow=nb.samples, ncol=nb.outputs, byrow=T,
dimnames=list(samples=paste("s",seq(1,nb.samples),sep=""),
outputs=paste("out",seq(1,nb.outputs),sep="")))
Here is my own function used just below:
GetResFromCustomLinReg <- function(Y, xi){ # both Y and xi are N-dim vectors
n <- length(Y)
X <- cbind(rep(1,n), xi) #
p <- 1 # nb of explanatory variables, besides the mean
r <- p + 1 # rank of X: nb of indepdt explanatory variables
inv.XtX <- solve(t(X) %*% X)
beta.hat <- inv.XtX %*% t(X) %*% Y
Y.hat <- X %*% beta.hat
E.hat <- Y - Y.hat
E2.hat <- (t(E.hat) %*% E.hat)
sigma2.hat <- (E2.hat / (n - r))[1,1]
var.covar.beta.hat <- sigma2.hat * inv.XtX
se.beta.hat <- t(t(sqrt(diag(var.covar.beta.hat))))
Y.bar <- mean(Y)
R2 <- 1 - (E2.hat) / (t(Y-Y.bar) %*% (Y-Y.bar))
return(c(beta.hat[2], se.beta.hat[2], sigma2.hat, R2))
}
Here is my code using the built-in "lm":
res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)})
Here is my custom OLS code:
res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)})
When I run this example with the values given above, I get:
> system.time( res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)}) )
user system elapsed
2.526 0.000 2.528
> system.time( res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
user system elapsed
4.561 0.000 4.561
(And, naturally, it gets worse when increasing N, X and Y.)
Of course, "lm" has the nice property of "automatically" fitting separately each column of the response matrix (y~xi), while I have to use "apply" Y times (for each yi~xi). But is this the only reason why my code is slower? Does one of you know how to improve this?
(Sorry for such a long question, but I really tried to provide a minimal, yet comprehensive example.)
> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-redhat-linux-gnu (64-bit)
Have a look at the fastLm()
function in the RcppArmadillo package on CRAN. There is also a similar fastLm()
in RcppGSL which preceded this -- but you probably want the Armadillo-based solution. I have some slides in older presentations (on HPC with R) that show the speed gains.
Also note the hint in the help page about better 'pivoted' approaches than the straight inverse of X'X which can matter with degenerate model matrices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With