Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why calculating MSE in lasso regression gives different outputs?

Tags:

I am trying to run different regression models on the Prostate cancer data from the lasso2 package. When I use Lasso, I saw two different methods to calculate the mean square error. But they do give me quite different results, so I would want to know if I'm doing anything wrong or if it just means that one method is better than the other ?

# Needs the following R packages.
library(lasso2)
library(glmnet)

# Gets the prostate cancer dataset
data(Prostate)

# Defines the Mean Square Error function 
mse = function(x,y) { mean((x-y)^2)}

# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))

# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)

# Training set
train = Prostate[train_ind, ]

# Test set
test = Prostate[-train_ind, ]

# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa

# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min

# Getting predictions on the "test" data set and calculating the mean     square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest) 

# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")

# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")

So these are the outputs I got for both MSE:

MSE (method 1): 0.4609978 
MSE (method 2): 0.5654089 

And they're quite different. Does anyone know why ? Thanks a lot in advance for your help!

Samuel

like image 509
chogall Avatar asked Sep 14 '16 04:09

chogall


People also ask

How do you interpret MSE in linear regression?

The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs.

What does MSE tell you in regression?

The Mean Squared Error measures how close a regression line is to a set of data points. It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function.

Is it better to have a higher or lower MSE?

There is no correct value for MSE. Simply put, the lower the value the better and 0 means the model is perfect.

Why do we use MSE for regression?

MSE is used to check how close estimates or forecasts are to actual values. Lower the MSE, the closer is forecast to actual. This is used as a model evaluation measure for regression models and the lower value indicates a better fit.


1 Answers

As pointed out by @alistaire, in the first case you are using the test data to compute the MSE, in the second case the MSE from the cross-validation (training) folds are reported, so it's not an apples to apples comparison.

We can do something like the following to do apples to apples comparison (by keeping the fitted values on the training folds) and as we can see mse.1 and mse.2 are exactly equal if computed on the same training folds (although the value is little different from yours, with my desktop R version 3.1.2, x86_64-w64-mingw32, windows 10):

# Needs the following R packages.
library(lasso2)
library(glmnet)

# Gets the prostate cancer dataset
data(Prostate)

# Defines the Mean Square Error function 
mse = function(x,y) { mean((x-y)^2)}

# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))

# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)

# Training set
train = Prostate[train_ind, ]

# Test set
test = Prostate[-train_ind, ]

# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa

# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)

# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain) 
cat("MSE (method 1): ", mse.1, "\n")

MSE (method 1):  0.6044496 

# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")

MSE (method 2):  0.6044496 

mse.1 == mse.2
[1] TRUE
like image 85
Sandipan Dey Avatar answered Sep 23 '22 16:09

Sandipan Dey