Is there a better inbuilt way to do grid search and test multiple models in a single pipeline? Of course the parameters of the models would be different, which made is complicated for me to figure this out. Here is what I did:
from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from sklearn.grid_search import GridSearchCV def grid_search(): pipeline1 = Pipeline(( ('clf', RandomForestClassifier()), ('vec2', TfidfTransformer()) )) pipeline2 = Pipeline(( ('clf', KNeighborsClassifier()), )) pipeline3 = Pipeline(( ('clf', SVC()), )) pipeline4 = Pipeline(( ('clf', MultinomialNB()), )) parameters1 = { 'clf__n_estimators': [10, 20, 30], 'clf__criterion': ['gini', 'entropy'], 'clf__max_features': [5, 10, 15], 'clf__max_depth': ['auto', 'log2', 'sqrt', None] } parameters2 = { 'clf__n_neighbors': [3, 7, 10], 'clf__weights': ['uniform', 'distance'] } parameters3 = { 'clf__C': [0.01, 0.1, 1.0], 'clf__kernel': ['rbf', 'poly'], 'clf__gamma': [0.01, 0.1, 1.0], } parameters4 = { 'clf__alpha': [0.01, 0.1, 1.0] } pars = [parameters1, parameters2, parameters3, parameters4] pips = [pipeline1, pipeline2, pipeline3, pipeline4] print "starting Gridsearch" for i in range(len(pars)): gs = GridSearchCV(pips[i], pars[i], verbose=2, refit=False, n_jobs=-1) gs = gs.fit(X_train, y_train) print "finished Gridsearch" print gs.best_score_
However, this approach is still giving the best model within each classifier, and not comparing between classifiers.
Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. It is similar to grid search, and yet it has proven to yield better results comparatively. The drawback of random search is that it yields high variance during computing.
Cross-validation is a method for robustly estimating test-set performance (generalization) of a model. Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.
The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.
One of the most important and generally-used methods for performing hyperparameter tuning is called the exhaustive grid search. This is a brute-force approach because it tries all of the combinations of hyperparameters from a grid of parameter values.
Although the solution from dubek is more straight forward, it does not help with interactions between parameters of pipeline elements that come before the classfier. Therefore, I have written a helper class to deal with it, and can be included in the default Pipeline setting of scikit. A minimal example:
from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler, MaxAbsScaler from sklearn.svm import LinearSVC from sklearn.ensemble import RandomForestClassifier from sklearn import datasets from pipelinehelper import PipelineHelper iris = datasets.load_iris() X_iris = iris.data y_iris = iris.target pipe = Pipeline([ ('scaler', PipelineHelper([ ('std', StandardScaler()), ('max', MaxAbsScaler()), ])), ('classifier', PipelineHelper([ ('svm', LinearSVC()), ('rf', RandomForestClassifier()), ])), ]) params = { 'scaler__selected_model': pipe.named_steps['scaler'].generate({ 'std__with_mean': [True, False], 'std__with_std': [True, False], 'max__copy': [True], # just for displaying }), 'classifier__selected_model': pipe.named_steps['classifier'].generate({ 'svm__C': [0.1, 1.0], 'rf__n_estimators': [100, 20], }) } grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=1) grid.fit(X_iris, y_iris) print(grid.best_params_) print(grid.best_score_)
It can also be used for other elements of the pipeline, not just the classifier. Code is on github if anyone wants to check it out.
Edit: I have published this on PyPI if anyone is interested, just install ti using pip install pipelinehelper
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With