Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn: use Pipeline in a RandomizedSearchCV?

I'd like to be able to use pipelines in the RandomizedSearchCV construct in sklearn. However right now I believe that only estimators are supported. Here's an example of what I'd like to be able to do:

import numpy as np

from sklearn.grid_search import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler    
from sklearn.pipeline import Pipeline

# get some data
iris = load_digits()
X, y = iris.data, iris.target

# specify parameters and distributions to sample from
param_dist = {'C': [1, 10, 100, 1000], 
          'gamma': [0.001, 0.0001], 
          'kernel': ['rbf', 'linear'],}

# create pipeline with a scaler 
steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())]
pipeline = Pipeline(steps)

# do search
search = RandomizedSearchCV(pipeline, 
param_distributions=param_dist, n_iter=50)
search.fit(X, y)

print search.grid_scores_

If you just run like this, you'll get the following error:

ValueError: Invalid parameter kernel for estimator Pipeline

Is there a good way to do this in sklearn?

like image 334
lollercoaster Avatar asked Jan 27 '15 19:01

lollercoaster


People also ask

Should I use Sklearn pipeline?

All in all, Scikit-learn pipelines serve as a means to chain together all of the steps in a machine learning task in a more concise manner. They may not improve model performance, but their ability to streamline the machine learning workflow makes them invaluable.

What is the difference between GridSearchCV and RandomizedSearchCV?

The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.

What's the difference between pipeline () and make_pipeline () from Sklearn library?

The only difference is that make_pipeline generates names for steps automatically.


2 Answers

I think this is what you need (section 3).

pipeline.get_params().keys() -> make sure your param grid keys match those returned by this.

like image 28
dzenilee Avatar answered Sep 19 '22 03:09

dzenilee


RandomizedSearchCV, as well as GridSearchCV, do support pipelines (in fact, they're independent of their implementation, and pipelines are designed to be equivalent to usual classifiers).

The key to the issue is pretty straightforward if you think, what parameters should search be done over. Since pipeline consists of many objects (several transformers + a classifier), one may want to find optimal parameters both for the classifier and transformers. Thus, you need to somehow distinguish where to get / set properties from / to.

So what you need to do is to say that you want to find a value for, say, not just some abstract gamma (which pipeline doesn't have at all), but gamma of pipeline's classifier, which is called in your case rbf_svm (that also justifies the need for names). This can be achieved using double underscore syntax, widely used in sklearn for nested models:

param_dist = {
          'rbf_svm__C': [1, 10, 100, 1000], 
          'rbf_svm__gamma': [0.001, 0.0001], 
          'rbf_svm__kernel': ['rbf', 'linear'],
}
like image 84
Artem Sobolev Avatar answered Sep 20 '22 03:09

Artem Sobolev