Cross validation with grid search returns worse results than default

Tags:

I'm using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the "best" parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I'm surprised this would happen.

For example:

from sklearn import svm, grid_search
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(verbose=1)
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],  
              'min_samples_split':[2,5,10,20], 
              'max_depth':[2,3,5,10]}
clf = grid_search.GridSearchCV(gbc, parameters)
t0 = time()
clf.fit(X_crossval, labels)
print "Gridsearch time:", round(time() - t0, 3), "s"
print clf.best_params_
# The output is: {'min_samples_split': 2, 'learning_rate': 0.01, 'max_depth': 2}

This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default.

One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation. Then after obtaining the parameters, I ran it using the same dataset, split into 75-25 training/testing.

Is there a reason my grid search overlooked the "superior" defaults?

377

asked Apr 20 '17 22:04

Nicholas Hassan

1 Answers

Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that's at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.

Assuming you're using the iris dataset (that's the dataset used in the example in your comment link), here's an example of how GridSearchCV parameter optimization is affected by first making a holdout set with train_test_split:

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

iris = datasets.load_iris()
gbc = GradientBoostingClassifier()
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1], 
              'min_samples_split':[2,5,10,20], 
              'max_depth':[2,3,5,10]}

clf = GridSearchCV(gbc, parameters)
clf.fit(iris.data, iris.target)

print(clf.best_params_)
# {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}

Now repeat the grid search using a random training subset:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target, 
                                                 test_size=0.33, 
                                                 random_state=42)

clf = GridSearchCV(gbc, parameters)
clf.fit(X_train, y_train)

print(clf.best_params_)
# {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}

I'm seeing much higher classification accuracy with both of these approaches, which makes me think maybe you're using different data - but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.

168

answered Oct 24 '22 03:10

andrew_reece

Related questions
                            
                                Django runserver does not respond when opened in the browser
                            
                                Reading UTF-8 with BOM using Python CSV module causes unwanted extra characters [duplicate]
                            
                                How to checkpoint a long-running function pythonically?
                            
                                Parsing multipart/form-data in django-rest-framework
                            
                                Open quantum system modelling
                            
                                static openCL class not properly released in python module using boost.python
                            
                                Keras. ValueError: I/O operation on closed file
                            
                                Celery vs. ProcessPoolExecutor / ThreadPoolExecutor
                            
                                Tweaking axis labels and names orientation for 3D plots in matplotlib
                            
                                Place ipywidgets into HTML into Jupyter notebook
                            
                                Depend on git repository in setup.py
                            
                                Matching Unicode word boundaries in Python
                            
                                tensorflow: efficient feeding of eval/train data using queue runners
                            
                                How to get XKCD font working in matplotlib
                            
                                Import method from __init__.py
                            
                                Django: ContentTypes during migration while running tests
                            
                                Panda Dataframe Resampling based on column criteria
                            
                                How to apply outer product for tensors without unnecessary increase of dimensions?
                            
                                Keras + Tensorflow: Prediction on multiple gpus
                            
                                Python - Access to a protected member _ of a class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cross validation with grid search returns worse results than default

Tags:

python

machine-learning

scikit-learn

cross-validation

grid-search

Nicholas Hassan

People also ask

1 Answers

andrew_reece

Recent Activity

Donate For Us