I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.
"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.
The difference between the two is that Lasso expects you to set the penalty and LassoCV performs a grid search using cross-validated MSE (CV-MSE) to find an optimal choice of the regularization strength.
Lasso Regression Cross-validation Python Example linear_model LassoCV is used as Lasso regression cross validation implementation. LassoCV takes one of the parameter inputs as “cv” which represents a number of folds to be considered while applying cross-validation. In the example below, the value of cv is set to 5.
If you use sklearn.cross_validation.cross_val_score
with a sklearn.linear_model.LassoCV
object, then you are performing nested cross-validation. cross_val_score
will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold
). The train set will be passed to the LassoCV
, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV
will not do all the work for you, and you have to use it in conjunction with cross_val_score
to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV
without necessarily evaluating it directly on another set of held out data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With