I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables:
gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training
and I got an R^2 of 0.43. I then tried predicting on my test data using the rest of the data:
ytest=y[(ceiling(length(y)*(3/4))+1):length(y)]
x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test
yhat <- as.matrix(x.test)%*%gl.fit$coefficients #Calculate the predicted values
I now would like to calculate the R^2 value on my test data. Is there any easy way to calculate that?
Thank you
Calculating R-squared on the testing data is a little tricky, as you have to remember what your baseline is. Your baseline projection is a mean of your training data.
Therefore, extending the example provided by @jlhoward above:
SS.test.total <- sum((test.y - mean(df[train,]$y))^2)
SS.test.residual <- sum((test.y - test.pred)^2)
SS.test.regression <- sum((test.pred - mean(df[train,]$y))^2)
SS.test.total - (SS.test.regression+SS.test.residual)
# [1] 11617720 not 8958890
test.rsq <- 1 - SS.test.residual/SS.test.total
test.rsq
# [1] 0.09284556 not 0.0924713
# fraction of variability explained by the model
SS.test.regression/SS.test.total
# [1] 0.08907705 not 0.08956405
Update: miscTools::rSquared()
function makes an assumption that R-squared is calculated on the same dataset, on which the model is trained, as it calculates
yy <- y - mean(y)
behind the scenes in line 184 here: https://github.com/cran/miscTools/blob/master/R/utils.R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With