Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross validation with grid search returns worse results than default

I'm using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the "best" parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I'm surprised this would happen.

For example:

from sklearn import svm, grid_search
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(verbose=1)
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],  
              'min_samples_split':[2,5,10,20], 
              'max_depth':[2,3,5,10]}
clf = grid_search.GridSearchCV(gbc, parameters)
t0 = time()
clf.fit(X_crossval, labels)
print "Gridsearch time:", round(time() - t0, 3), "s"
print clf.best_params_
# The output is: {'min_samples_split': 2, 'learning_rate': 0.01, 'max_depth': 2}

This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default.

One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation. Then after obtaining the parameters, I ran it using the same dataset, split into 75-25 training/testing.

Is there a reason my grid search overlooked the "superior" defaults?

like image 377
Nicholas Hassan Avatar asked Apr 20 '17 22:04

Nicholas Hassan


People also ask

Which is better grid search or randomized search?

Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. It is similar to grid search, and yet it has proven to yield better results comparatively. The drawback of random search is that it yields high variance during computing.

What is the difference between grid search and cross-validation?

Cross-validation is a method for robustly estimating test-set performance (generalization) of a model. Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.

Does grid search perform cross-validation?

Grid Search CV: Scikit-Learn library comes with grid search cross-validation implementation. Grid Search CV tries all combinations of parameters grid for a model and returns with the best set of parameters having the best performance score.

Which is better GridSearchCV or RandomizedSearchCV?

The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.


1 Answers

Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that's at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.

Assuming you're using the iris dataset (that's the dataset used in the example in your comment link), here's an example of how GridSearchCV parameter optimization is affected by first making a holdout set with train_test_split:

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

iris = datasets.load_iris()
gbc = GradientBoostingClassifier()
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1], 
              'min_samples_split':[2,5,10,20], 
              'max_depth':[2,3,5,10]}

clf = GridSearchCV(gbc, parameters)
clf.fit(iris.data, iris.target)

print(clf.best_params_)
# {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}

Now repeat the grid search using a random training subset:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target, 
                                                 test_size=0.33, 
                                                 random_state=42)

clf = GridSearchCV(gbc, parameters)
clf.fit(X_train, y_train)

print(clf.best_params_)
# {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}

I'm seeing much higher classification accuracy with both of these approaches, which makes me think maybe you're using different data - but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.

like image 168
andrew_reece Avatar answered Oct 24 '22 03:10

andrew_reece