Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does lm return values when there is no variance in the predicted value?

Consider the following R code (which, I think, eventually calls some Fortran):

X <- 1:1000
Y <- rep(1,1000)
summary(lm(Y~X))

Why are values returned by summary? Shouldn't this model fail to fit since there is no variance in Y? More importantly, why is the model R^2 ~= .5?

Edit

I tracked the code from lm to lm.fit and can see this call:

z <- .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,
   tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y,
   effects = y, rank = integer(1L), pivot = 1L:p, qraux = double(p),
   work = double(2 * p), PACKAGE = "base")

That is where the actual fit seems to happen. Looking at http://svn.r-project.org/R/trunk/src/appl/dqrls.f) did not help me understand what is going on, because I do not know fortran.

like image 826
russellpierce Avatar asked Feb 11 '12 23:02

russellpierce


2 Answers

Statistically speaking, what should we anticipate (I'd like to say "expect", but that's a very specific term ;-))? The coefficients should be (0,1), rather than "fail to fit". The covariance of (X,Y) is assumed proportional to the variance of X, not the other way around. As X has non-zero variance, there is no problem. As the covariance is 0, the estimated coefficient for X should be 0. So, within machine tolerance, this is the answer you're getting.

There is no statistical anomaly here. There may be a statistical misunderstanding. There's also the issue of machine tolerance, but a coefficient on the order of 1E-19 is rather negligible, given the scale of the predictor and response values.

Update 1: A quick review of simple linear regression can be found on this Wikipedia page. The key thing to note is that Var(x) is in the denominator, Cov(x,y) in the numerator. In this case, the numerator is 0, the denominator is non-zero, so there is no reason to expect a NaN or NA. However, one may ask why isn't the resulting coefficient for x a 0, and that has to do with numerical precision issues of the QR decomposition.

like image 186
Iterator Avatar answered Sep 19 '22 12:09

Iterator


I believe this is simply because the QR decomposition is implemented with floating point arithmetic.

The singular.ok parameter actually refers to the design matrix (i.e. X only). Try

lm.fit(cbind(X, X), Y)

vs.

lm.fit(cbind(X, X), Y, singular.ok=F)
like image 28
John Colby Avatar answered Sep 22 '22 12:09

John Colby