Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does LassoCV in scikit-learn partition data?

I am performing linear regression using the Lasso method in sklearn.

According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.

The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.

Actual Question

Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?

Thanks.

like image 597
Sirrah Avatar asked Jun 15 '14 20:06

Sirrah


People also ask

How does cross Val score work?

"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.

What is the difference between Lasso and LassoCV?

The difference between the two is that Lasso expects you to set the penalty and LassoCV performs a grid search using cross-validated MSE (CV-MSE) to find an optimal choice of the regularization strength.

What is LassoCV in Python?

Lasso Regression Cross-validation Python Example linear_model LassoCV is used as Lasso regression cross validation implementation. LassoCV takes one of the parameter inputs as “cv” which represents a number of folds to be considered while applying cross-validation. In the example below, the value of cv is set to 5.


1 Answers

If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.

import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV

X = np.random.randn(20, 10)
y = np.random.randn(len(X))

cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3)  # cv=3 makes a KFold inner splitting with 3 folds

scores = cross_val_score(lasso, X, y, cv=cv_outer)

Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

like image 100
eickenberg Avatar answered Sep 27 '22 18:09

eickenberg