I'm new to sklearn
's Pipeline
and GridSearchCV
features. I am trying to build a pipeline which first does RandomizedPCA on my training data and then fits a ridge regression model. Here is my code:
pca = RandomizedPCA(1000, whiten=True)
rgn = Ridge()
pca_ridge = Pipeline([('pca', pca),
('ridge', rgn)])
parameters = {'ridge__alpha': 10 ** np.linspace(-5, -2, 3)}
grid_search = GridSearchCV(pca_ridge, parameters, cv=2, n_jobs=1, scoring='mean_squared_error')
grid_search.fit(train_x, train_y[:, 1:])
I know about the RidgeCV
function but I want to try out Pipeline and GridSearch CV.
I want the grid search CV to report RMSE error, but this doesn't seem supported in sklearn so I'm making do with MSE. However, the scores it resports are negative:
In [41]: grid_search.grid_scores_
Out[41]:
[mean: -0.02665, std: 0.00007, params: {'ridge__alpha': 1.0000000000000001e-05},
mean: -0.02658, std: 0.00009, params: {'ridge__alpha': 0.031622776601683791},
mean: -0.02626, std: 0.00008, params: {'ridge__alpha': 100.0}]
Obviously this isn't possible for mean squared error - what am I doing wrong here?
GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.
The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).
Let's take example of common machine learning algorithms starting with regression models: There are two different approaches which you can take, use gridsearchcv to perform hyperparameter tuning on one model or multiple models.
Those scores are negative MSE scores, i.e. negate them and you get the MSE. The thing is that GridSearchCV
, by convention, always tries to maximize its score so loss functions like MSE have to be negated.
An alternate way to create GridSearchCV
is to use make_scorer
and turn greater_is_better
flag to False
So, if clf is your classifier, and parameters are your hyperparameter lists, you can use the make_scorer
like this:
from sklearn.metrics import make_scorer
#define your own mse and set greater_is_better=False
mse = make_scorer(mean_squared_error,greater_is_better=False)
Now, same as below, you can call the GridSearch and pass your defined mse
grid_obj = GridSearchCV(clf, parameters, cv=5,scoring=mse,n_jobs = -1, verbose=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With