Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grid search over multiple classifiers

Is there a better inbuilt way to do grid search and test multiple models in a single pipeline? Of course the parameters of the models would be different, which made is complicated for me to figure this out. Here is what I did:

from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from sklearn.grid_search import GridSearchCV   def grid_search():     pipeline1 = Pipeline((     ('clf', RandomForestClassifier()),     ('vec2', TfidfTransformer())     ))      pipeline2 = Pipeline((     ('clf', KNeighborsClassifier()),     ))      pipeline3 = Pipeline((     ('clf', SVC()),     ))      pipeline4 = Pipeline((     ('clf', MultinomialNB()),     ))          parameters1 = {     'clf__n_estimators': [10, 20, 30],     'clf__criterion': ['gini', 'entropy'],     'clf__max_features': [5, 10, 15],     'clf__max_depth': ['auto', 'log2', 'sqrt', None]     }      parameters2 = {     'clf__n_neighbors': [3, 7, 10],     'clf__weights': ['uniform', 'distance']     }      parameters3 = {     'clf__C': [0.01, 0.1, 1.0],     'clf__kernel': ['rbf', 'poly'],     'clf__gamma': [0.01, 0.1, 1.0],      }     parameters4 = {     'clf__alpha': [0.01, 0.1, 1.0]     }      pars = [parameters1, parameters2, parameters3, parameters4]     pips = [pipeline1, pipeline2, pipeline3, pipeline4]          print "starting Gridsearch"     for i in range(len(pars)):         gs = GridSearchCV(pips[i], pars[i], verbose=2, refit=False, n_jobs=-1)         gs = gs.fit(X_train, y_train)         print "finished Gridsearch"         print gs.best_score_ 

However, this approach is still giving the best model within each classifier, and not comparing between classifiers.

like image 702
Aks Avatar asked Apr 13 '14 16:04

Aks


People also ask

Which is better randomized search or grid search?

Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. It is similar to grid search, and yet it has proven to yield better results comparatively. The drawback of random search is that it yields high variance during computing.

What is the difference between grid search and cross-validation?

Cross-validation is a method for robustly estimating test-set performance (generalization) of a model. Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.

What is the difference between grid search CV and RandomizedSearchCV?

The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.

What is exhaustive grid search?

One of the most important and generally-used methods for performing hyperparameter tuning is called the exhaustive grid search. This is a brute-force approach because it tries all of the combinations of hyperparameters from a grid of parameter values.


1 Answers

Although the solution from dubek is more straight forward, it does not help with interactions between parameters of pipeline elements that come before the classfier. Therefore, I have written a helper class to deal with it, and can be included in the default Pipeline setting of scikit. A minimal example:

from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler, MaxAbsScaler from sklearn.svm import LinearSVC from sklearn.ensemble import RandomForestClassifier from sklearn import datasets from pipelinehelper import PipelineHelper  iris = datasets.load_iris() X_iris = iris.data y_iris = iris.target pipe = Pipeline([     ('scaler', PipelineHelper([         ('std', StandardScaler()),         ('max', MaxAbsScaler()),     ])),     ('classifier', PipelineHelper([         ('svm', LinearSVC()),         ('rf', RandomForestClassifier()),     ])), ])  params = {     'scaler__selected_model': pipe.named_steps['scaler'].generate({         'std__with_mean': [True, False],         'std__with_std': [True, False],         'max__copy': [True],  # just for displaying     }),     'classifier__selected_model': pipe.named_steps['classifier'].generate({         'svm__C': [0.1, 1.0],         'rf__n_estimators': [100, 20],     }) } grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=1) grid.fit(X_iris, y_iris) print(grid.best_params_) print(grid.best_score_) 

It can also be used for other elements of the pipeline, not just the classifier. Code is on github if anyone wants to check it out.

Edit: I have published this on PyPI if anyone is interested, just install ti using pip install pipelinehelper.

like image 126
bmurauer Avatar answered Sep 23 '22 15:09

bmurauer