Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a quicker way of running GridsearchCV

I'm optimizing some paramters for an SVC in sklearn, and the biggest issue here is having to wait 30 minutes before I try out any other parameter ranges. Worse is the fact that I'd like to try more values for c and gamma within the same range (so I can create a smoother surface plot) but I know that it will just take longer and longer... When I ran it today I changed the cache_size from 200 to 600 (without really knowing what it does) to see if it made a difference. The time decreased by about a minute.

Is this something I can help? Or am I just gonna have to deal with a very long time?

clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)

grid = GridSearchCV(clf, param_grid, cv= 10, scoring="accuracy")
%time grid.fit(X_norm, y)

returns:

Wall time: 32min 59s

GridSearchCV(cv=10, error_score='raise',
   estimator=SVC(C=1.0, cache_size=600, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False),
   fit_params={}, iid=True, loss_func=None, n_jobs=1,
   param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0], 'gamma': [1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]},
   pre_dispatch='2*n_jobs', refit=True, score_func=None,
   scoring='accuracy', verbose=0)
like image 441
bidby Avatar asked Feb 26 '16 15:02

bidby


People also ask

How long does it take to run GridSearchCV?

Only ~7.5k records were used for training with cv=3, and ~3k records for testing purpose. Observing the above time numbers, for parameter grid having 3125 combinations, the Grid Search CV took 10856 seconds (~3 hrs) whereas Halving Grid Search CV took 465 seconds (~8 mins), which is approximate 23x times faster.

Why does hyperparameter tuning take so long?

This means that all combinations of hyperparameters will be trained using cross-validation. If there are 100 possible candidates and you are doing 5-fold cross-validation, the given model will be trained 500 times (500 iterations). Surely, this will take an excruciatingly long time for heavy models.

Is RandomizedSearchCV better than GridSearchCV?

The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.


2 Answers

A few things:

  1. 10-fold CV is overkill and causes you to fit 10 models for each parameter group. You can get an instant 2-3x speedup by switching to 5- or 3-fold CV (i.e., cv=3 in the GridSearchCV call) without any meaningful difference in performance estimation.
  2. Try fewer parameter options at each round. With 9x9 combinations, you're trying 81 different combinations on each run. Typically, you'll find better performance at one end of the scale or the other, so maybe start with a coarse grid of 3-4 options, and then go finer as you start to identify the area that's more interesting for your data. 3x3 options means a 9x speedup vs. what you're doing now.
  3. You can get a trivial speedup by setting njobs to 2+ in your GridSearchCV call so you run multiple models at once. Depending on the size of your data, you may not be able to increase it too high, and you won't see an improvement increasing it past the number of cores you're running, but you can probably trim a bit of time that way.
like image 68
Randy Avatar answered Oct 02 '22 23:10

Randy


Also you could set probability=False inside of SVC estimator to avoid applying expensive Platt's calibration internally. (If having ability to run predict_proba is crucial, perform GridSearchCv with refit=False, and after picking best paramset in terms of model's quality on test set just retrain best estimator with probability=True on whole training set.)

Another step would be to use RandomizedSearchCV instead of GridSearchCV, which would allow you to reach better model quality at roughly the same time (as controlled by n_iters parameter).

And, as already mentioned, use n_jobs=-1

like image 21
Anatoly Alekseev Avatar answered Oct 02 '22 22:10

Anatoly Alekseev