Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R-squared on test data

I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables:

gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training

and I got an R^2 of 0.43. I then tried predicting on my test data using the rest of the data:

ytest=y[(ceiling(length(y)*(3/4))+1):length(y)]
x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test
yhat <- as.matrix(x.test)%*%gl.fit$coefficients  #Calculate the predicted values

I now would like to calculate the R^2 value on my test data. Is there any easy way to calculate that?

Thank you

like image 718
H_A Avatar asked Sep 05 '14 17:09

H_A


1 Answers

Calculating R-squared on the testing data is a little tricky, as you have to remember what your baseline is. Your baseline projection is a mean of your training data.

Therefore, extending the example provided by @jlhoward above:

SS.test.total      <- sum((test.y - mean(df[train,]$y))^2)
SS.test.residual   <- sum((test.y - test.pred)^2)
SS.test.regression <- sum((test.pred - mean(df[train,]$y))^2)
SS.test.total - (SS.test.regression+SS.test.residual)
# [1] 11617720 not 8958890

test.rsq <- 1 - SS.test.residual/SS.test.total  
test.rsq
# [1] 0.09284556 not 0.0924713

# fraction of variability explained by the model
SS.test.regression/SS.test.total 
# [1] 0.08907705 not 0.08956405

Update: miscTools::rSquared() function makes an assumption that R-squared is calculated on the same dataset, on which the model is trained, as it calculates

yy <- y - mean(y)

behind the scenes in line 184 here: https://github.com/cran/miscTools/blob/master/R/utils.R

like image 185
dmi3kno Avatar answered Sep 28 '22 05:09

dmi3kno