Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn GridSearchCV with Pipeline

I'm new to sklearn's Pipeline and GridSearchCV features. I am trying to build a pipeline which first does RandomizedPCA on my training data and then fits a ridge regression model. Here is my code:

pca = RandomizedPCA(1000, whiten=True)
rgn = Ridge()

pca_ridge = Pipeline([('pca', pca),
                      ('ridge', rgn)])

parameters = {'ridge__alpha': 10 ** np.linspace(-5, -2, 3)}

grid_search = GridSearchCV(pca_ridge, parameters, cv=2, n_jobs=1, scoring='mean_squared_error')
grid_search.fit(train_x, train_y[:, 1:])

I know about the RidgeCV function but I want to try out Pipeline and GridSearch CV.

I want the grid search CV to report RMSE error, but this doesn't seem supported in sklearn so I'm making do with MSE. However, the scores it resports are negative:

In [41]: grid_search.grid_scores_
Out[41]: 
[mean: -0.02665, std: 0.00007, params: {'ridge__alpha': 1.0000000000000001e-05},
 mean: -0.02658, std: 0.00009, params: {'ridge__alpha': 0.031622776601683791},
 mean: -0.02626, std: 0.00008, params: {'ridge__alpha': 100.0}]

Obviously this isn't possible for mean squared error - what am I doing wrong here?

like image 452
mchangun Avatar asked Jan 10 '14 16:01

mchangun


People also ask

How does Sklearn GridSearchCV work?

GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.

What is use of pipeline in Sklearn?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.

What is the difference between pipeline and Make_pipeline?

The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).

Can we use GridSearchCV for regression?

Let's take example of common machine learning algorithms starting with regression models: There are two different approaches which you can take, use gridsearchcv to perform hyperparameter tuning on one model or multiple models.


2 Answers

Those scores are negative MSE scores, i.e. negate them and you get the MSE. The thing is that GridSearchCV, by convention, always tries to maximize its score so loss functions like MSE have to be negated.

like image 73
Fred Foo Avatar answered Oct 13 '22 12:10

Fred Foo


An alternate way to create GridSearchCV is to use make_scorer and turn greater_is_better flag to False

So, if clf is your classifier, and parameters are your hyperparameter lists, you can use the make_scorer like this:

from sklearn.metrics import make_scorer
#define your own mse and set greater_is_better=False
mse = make_scorer(mean_squared_error,greater_is_better=False)

Now, same as below, you can call the GridSearch and pass your defined mse

grid_obj = GridSearchCV(clf, parameters, cv=5,scoring=mse,n_jobs = -1, verbose=True)
like image 21
Espanta Avatar answered Oct 13 '22 12:10

Espanta