Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
To calculate the MSE, we need to square the residuals, so that a negative residual has the same contribution to the mean as an equivalent-magnitude positive residual. For example, say the actual value of the response variable was 7, and we predicted 5. The residual is 5 – 7 = –2.
To find the MSE, take the observed value, subtract the predicted value, and square that difference. Repeat that for all observations. Then, sum all of those squared values and divide by the number of observations.
Training Error versus Test ErrorA smaller MSE means that the estimate is more accurate. It is important to realise that this MSE value is computed using only the training data. That is, it is computed using only the data that the model was fitted on. Hence, it is actually known as the training MSE.
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With