Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to estimate the progress of a GridSearchCV from verbose output in Scikit-Learn?

Right now I'm running a pretty aggressive grid search. I have n=135 samples and I am running 23 folds using a custom cross-validation train/test list. I have my verbose=2.

The following is what I ran:

param_test = {"loss":["deviance"],
           'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
           "min_samples_split": np.linspace(0.1, 0.5, 12),
           "min_samples_leaf": np.linspace(0.1, 0.5, 12),
          "min_impurity_split":[5e-6, 1e-7, 5e-7],
          "criterion": ["friedman_mse",  "mae"],
           "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],

Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(),
                           param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2)

I took a look at the verbose output in stdout:

$head gridsearch.o8475533
Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits

Based on this, it looks like there are 5842368 permutations of cross-validation pairs using my grid params.

$ grep -c  "[CV]" gridsearch.o8475533

It looks like there are around 7 million cross-validations that have been done so far but that's more than the 5842368 total fits...

7047332/5842368 = 1.2062458236

Then when I look at the stderr file:

$ cat ./gridsearch.e8475533
[Parallel(n_jobs=32)]: Done 132 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 538 tasks      | elapsed:    2.8s
[Parallel(n_jobs=32)]: Done 1104 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 1834 tasks      | elapsed:    7.9s
[Parallel(n_jobs=32)]: Done 2724 tasks      | elapsed:   11.6s
[Parallel(n_jobs=32)]: Done 3396203 tasks      | elapsed: 250.2min
[Parallel(n_jobs=32)]: Done 3420769 tasks      | elapsed: 276.5min
[Parallel(n_jobs=32)]: Done 3447309 tasks      | elapsed: 279.3min
[Parallel(n_jobs=32)]: Done 3484240 tasks      | elapsed: 282.3min
[Parallel(n_jobs=32)]: Done 3523550 tasks      | elapsed: 285.3min

My goal:

How can I know the progress of my gridsearch with respect to the total time it may take?

What I'm confused about:

What is the relationship between [CV] lines in stdout, total # of fits in stdout, and tasks in stderr?

like image 922
O.rka Avatar asked Apr 13 '17 18:04


People also ask

What does verbose do in GridSearchCV?

verbose: you can set it to 1 to get the detailed print out while you fit the data to GridSearchCV 6. n_jobs: number of processes you wish to run in parallel for this task if it -1 it will use all available processors.

How long does it take to run GridSearchCV?

Observing the above time numbers, for parameter grid having 3125 combinations, the Grid Search CV took 10856 seconds (~3 hrs) whereas Halving Grid Search CV took 465 seconds (~8 mins), which is approximate 23x times faster.

What is verbose in logistic regression?

Verbose is a general programming term for produce lots of logging output.

1 Answers

Math is simple, but a little misleading at a first sight:

  1. When each task is started logging mechanism yields a '[CV] ...' line to stdout noting about starting of execution and after task ends - another line with the addition of spent time for a particular task (in the end of the line).

  2. Additionally, with some time intervals, logging mechanism writes a progress bar to stderr (or if you set verbose to >50 to stdout) indicating a number of completed task out of total tasks (fits) and total currently spent time, like that one:

    [Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s

For your case, you have 5842368 total fits, i.e. tasks.

You counted 7047332 of '[CV] ...' which is around 7047332/2 = 3523666 finished tasks and progress bar shows exactly how many tasks are completed - 3523550 (around - because some tasks could start, but not end at the time of counting).

like image 62
vladkha Avatar answered Sep 17 '22 04:09
