Is there a quicker way of running GridsearchCV

Tags:

I'm optimizing some paramters for an SVC in sklearn, and the biggest issue here is having to wait 30 minutes before I try out any other parameter ranges. Worse is the fact that I'd like to try more values for c and gamma within the same range (so I can create a smoother surface plot) but I know that it will just take longer and longer... When I ran it today I changed the cache_size from 200 to 600 (without really knowing what it does) to see if it made a difference. The time decreased by about a minute.

Is this something I can help? Or am I just gonna have to deal with a very long time?

clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)

grid = GridSearchCV(clf, param_grid, cv= 10, scoring="accuracy")
%time grid.fit(X_norm, y)

returns:

Wall time: 32min 59s

GridSearchCV(cv=10, error_score='raise',
   estimator=SVC(C=1.0, cache_size=600, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False),
   fit_params={}, iid=True, loss_func=None, n_jobs=1,
   param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0], 'gamma': [1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]},
   pre_dispatch='2*n_jobs', refit=True, score_func=None,
   scoring='accuracy', verbose=0)

441

asked Feb 26 '16 15:02

bidby

2 Answers

A few things:

10-fold CV is overkill and causes you to fit 10 models for each parameter group. You can get an instant 2-3x speedup by switching to 5- or 3-fold CV (i.e., cv=3 in the GridSearchCV call) without any meaningful difference in performance estimation.
Try fewer parameter options at each round. With 9x9 combinations, you're trying 81 different combinations on each run. Typically, you'll find better performance at one end of the scale or the other, so maybe start with a coarse grid of 3-4 options, and then go finer as you start to identify the area that's more interesting for your data. 3x3 options means a 9x speedup vs. what you're doing now.
You can get a trivial speedup by setting njobs to 2+ in your GridSearchCV call so you run multiple models at once. Depending on the size of your data, you may not be able to increase it too high, and you won't see an improvement increasing it past the number of cores you're running, but you can probably trim a bit of time that way.

answered Oct 02 '22 23:10

Randy

Also you could set probability=False inside of SVC estimator to avoid applying expensive Platt's calibration internally. (If having ability to run predict_proba is crucial, perform GridSearchCv with refit=False, and after picking best paramset in terms of model's quality on test set just retrain best estimator with probability=True on whole training set.)

Another step would be to use RandomizedSearchCV instead of GridSearchCV, which would allow you to reach better model quality at roughly the same time (as controlled by n_iters parameter).

And, as already mentioned, use n_jobs=-1

answered Oct 02 '22 22:10

Anatoly Alekseev

Related questions
                            
                                Running command lines within your Python script
                            
                                OpenCV 2.4.1 - computing SURF descriptors in Python
                            
                                Is there a C/C++ API for python pandas? [closed]
                            
                                SQLAlchemy introspect column type with inheritance
                            
                                Apply function to pandas DataFrame that can return multiple rows
                            
                                Multiple legends in matplotlib in for loop
                            
                                Calling a function upon button press
                            
                                Pandas data frame from dictionary
                            
                                sys.stdin.readline() reads without prompt, returning 'nothing in between'
                            
                                broken easy_install and pip after upgrading to OS X Mavericks
                            
                                Get rows that have the same value across its columns in pandas
                            
                                Python: function and variable with the same name
                            
                                numpy sum of squares for matrix
                            
                                Using Selenium on Raspberry Pi headless
                            
                                make os.listdir() list complete paths
                            
                                how to find the unique non nan values in a numpy array?
                            
                                Swapping two sublists in a list
                            
                                Formatting Numbers So They Align On Decimal Point
                            
                                How To Turn Off Logging in Scrapy (Python)
                            
                                Django REST -- How to "modify" value before returning the REST response?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a quicker way of running GridsearchCV

Tags:

python

time

scikit-learn

svc

grid-search

bidby

People also ask

2 Answers

Randy

Anatoly Alekseev

Recent Activity

Donate For Us