Issues with the calculation of coefficient of determination in R

Question

I was playing around with R's lm() function and I noticed something odd when coming to the calculation of the coefficient of determination R^2. So suppose I have a toy regression problem:

set.seed(0); N= 12;
x_ <- 1:N; y_ <- 2* x_+ rnorm(N, sd=2); 
lm1_ <- lm( y_ ~ x_ ); S1 <- summary(lm1_); 
lm2_ <- lm( y_ ~ -1 + model.matrix(lm1_) ); S2 <- summary(lm2_);

So at this point we are actually setting lm1_ and lm2_ being the same thing. The same dependent variable is used, the same model matrix is used and we should get the same exact fit. Let's check those:

identical( fitted(lm2_),fitted(lm1_))       
[1] TRUE                                    #SURE!
identical( residuals(lm2_), residuals(lm1_))
[1] TRUE                                    #YEAP!
identical( lm1_$df.residual, lm2_$df.residual)
[1] TRUE                                    #SORTED!
identical( as.numeric(model.matrix(lm2_ )), as.numeric(model.matrix(lm1_ ))) 
[1] TRUE                                    #WHY NOT?
identical( S2$r.squared, S1$r.squared)        
[1] FALSE                                   #HUH?

What happened? Let's go and used the definition from the wikipedia article for the coefficient of determination (or percentage of variance explained if you like):

1 - ((sum( residuals(lm1_)^2))/( sum( (y_ -mean(y_))^2)))
[1] 0.8575376
1 - ((sum( residuals(lm2_)^2))/( sum( (y_ -mean(y_))^2)))
[1] 0.8575376
identical(1 - ((sum( residuals(lm1_)^2))/(sum( (y_ -mean(y_))^2))), S1$r.squared)
[1] TRUE
identical(1 - ((sum( residuals(lm2_)^2))/(sum( (y_ -mean(y_))^2))), S1$r.squared)
[1] TRUE
identical(1 - ((sum( residuals(lm2_)^2))/(sum( (y_ -mean(y_))^2))), S2$r.squared)
[1] FALSE
S2$r.squared
[1] 0.9700402 # ???

Now apparently the whole trick get perplexed by the function summary.lm(). R^2 is calculate to be mss/ (rss+ mss) where rss is the Residual Sum of Squares sum(residuals(lmX_)^2), and mss is the Mean Sum of Squares (fit) that is depend on the present of an intercept or not (c-p from summary.lm()):

mss <- if (attr(z$terms, "intercept")) 
            sum((f - mean(f))^2)
        else sum(f^2)

or in our case:

sum(scale(fitted(lm2_),scale=F)^2) / 
(sum(scale(fitted(lm2_),scale=F)^2)+sum(residuals(lm2_)^2))
[1] 0.8575376
sum(fitted(lm2_)^2) / (sum(fitted(lm2_)^2) + sum(residuals(lm2_)^2))
[1] 0.9700402

So clearly R did not identify the second model matrix as having an intercept. That though is a significant problem if as in my case someone is moving model matrices around. I know that the obvious fix will be to take the model matrix without the intercept but that does not alleviate the fact that this a design choice that is a bit prone to break. Wouldn't the more general definition save us this trouble with only a minor tweak? Is there some other obvious fix I can use do I don't bump into those problems?

This problem gets extremely severe if one is having not intercept and is using factor variables with interactions. Practically R might try and treat one of the levels as an intercept. Then the whole situation is just tragic:

library(MASS)
lm_oats <- lm( Y ~ -1+  V*N , oats)
S_oats <- summary(lm_oats)
S_oats$r.squared
[1] 0.9640413 # :*D
1-  ( sum( residuals(lm_oats)^2)  / sum( ( oats$Y- mean(oats$Y))^2 )  )
[1] 0.4256653 # :*(

#For the record:
sum((fitted(lm_oats))^2 )/(sum( residuals(lm_oats)^2) +sum((fitted(lm_oats))^2)) 
[1] 0.9640413
sum((scale(scale=F,fitted(lm_oats)))^2 ) /( sum( residuals(lm_oats)^2) 
+sum((scale(scale=F,fitted(lm_oats)))^2 )) 
[1] 0.4256653

Is this an R's design trait that somehow can be circumvented? (I appreciate that my question quite open ended thanks for all those that made all the way down here!)

I am using R version 3.0.1, Matrix_1.0-12, lattice_0.20-15, MASS_7.3-26. (Linux 32-bit)

Ben Bolker · Accepted Answer

There have been long discussions on the R mailing list about the non-sensibility/different meaning of R^2 for models that are forced through the intercept. R tries to detect when you are fitting models without an intercept ("semantically", i.e. by looking for the presence of -1 or +0 in the model formula) and modifies the R^2 calculation accordingly. If you do something tricky, you circumvent its attempt to do so.

More specifically:

summary.lm() tests whether the terms element of the fitted model has an intercept attribute equal to 1
the terms element of the fitted model is constructed by a call to model.frame()
model.frame() calls terms.formula
terms.formula calls some pretty hairy internal C code C_termsform, but you can see by trying out terms(y~x-1) vs terms(y~x+0) vs terms(y~x) that R is just checking semantically (as I suggested above ...)

Bottom line: if you want to do something fancy with model matrices that works around R's normal conventions, you should probably just go ahead and calculate R^2 according to your own preferred formula; you can copy the relevant code from summary.lm(). If you are going this deep you might want to use lm.fit() instead, for computational efficiency.

Issues with the calculation of coefficient of determination in R

Tags:

r

matrix

lm

usεr11852

1 Answers

Ben Bolker

Recent Activity

Donate For Us

Issues with the calculation of coefficient of determination in R

Tags:

r

matrix

lm

usεr11852

1 Answers

Ben Bolker

Related questions

Recent Activity

Donate For Us