Right now I'm running a pretty aggressive grid search. I have <code>n=135 samples</code> and I am running <code>23 folds</code> using a custom cross-validation train/test list. I have my <code>verbose=2</code>. The following is what I ran: <pre class="prettyprint"><code>param_test = {"loss":["deviance"], 'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2], "min_samples_split": np.linspace(0.1, 0.5, 12), "min_samples_leaf": np.linspace(0.1, 0.5, 12), "max_depth":[3,5,8], "max_features":["log2","sqrt"], "min_impurity_split":[5e-6, 1e-7, 5e-7], "criterion": ["friedman_mse", "mae"], "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0], "n_estimators":[10]} Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(), param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2) </code></pre> I took a look at the verbose output in <code>stdout</code>: <pre class="prettyprint"><code>$head gridsearch.o8475533 Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits </code></pre> Based on this, it looks like there are <code>5842368</code> permutations of cross-validation pairs using my grid params. <pre class="prettyprint"><code>$ grep -c "[CV]" gridsearch.o8475533 7047332 </code></pre> It looks like there are around 7 million cross-validations that have been done so far but that's more than the <code>5842368</code> total fits... <pre class="prettyprint"><code>7047332/5842368 = 1.2062458236 </code></pre> Then when I look at the <code>stderr</code> file: <pre class="prettyprint"><code>$ cat ./gridsearch.e8475533 [Parallel(n_jobs=32)]: Done 132 tasks | elapsed: 1.2s [Parallel(n_jobs=32)]: Done 538 tasks | elapsed: 2.8s [Parallel(n_jobs=32)]: Done 1104 tasks | elapsed: 4.8s [Parallel(n_jobs=32)]: Done 1834 tasks | elapsed: 7.9s [Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s ... [Parallel(n_jobs=32)]: Done 3396203 tasks | elapsed: 250.2min [Parallel(n_jobs=32)]: Done 3420769 tasks | elapsed: 276.5min [Parallel(n_jobs=32)]: Done 3447309 tasks | elapsed: 279.3min [Parallel(n_jobs=32)]: Done 3484240 tasks | elapsed: 282.3min [Parallel(n_jobs=32)]: Done 3523550 tasks | elapsed: 285.3min </code></pre> My goal: How can I know the progress of my gridsearch with respect to the total time it may take? What I'm confused about: What is the relationship between <code>[CV]</code> lines in <code>stdout</code>, total # of fits in <code>stdout</code>, and tasks in <code>stderr</code>?

Math is simple, but a little misleading at a first sight: <ol> <li>When each task is started logging mechanism yields a '[CV] ...' line to <code>stdout</code> noting about <code>starting</code> of execution and after task <code>ends</code> - another line with the addition of spent time for a particular task (in the end of the line).</li> <li> Additionally, with some time intervals, logging mechanism writes a progress bar to <code>stderr</code> (or if you set <code>verbose</code> to >50 to <code>stdout</code>) indicating a number of completed task out of total tasks (fits) and total currently spent time, like that one: <code>[Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s</code> </li> </ol> For your case, you have <code>5842368</code> total fits, i.e. tasks. You counted <code>7047332</code> of '[CV] ...' which is <code>around 7047332/2 = 3523666</code> finished tasks and progress bar shows <code>exactly</code> how many tasks are completed - 3523550 (around - because some tasks could start, but not end at the time of counting).

How to estimate the progress of a GridSearchCV from verbose output in Scikit-Learn?

Tags:

python

parameters

machine-learning

scikit-learn

grid-search

Right now I'm running a pretty aggressive grid search. I have n=135 samples and I am running 23 folds using a custom cross-validation train/test list. I have my verbose=2.

The following is what I ran:

param_test = {"loss":["deviance"],
           'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
           "min_samples_split": np.linspace(0.1, 0.5, 12),
           "min_samples_leaf": np.linspace(0.1, 0.5, 12),
           "max_depth":[3,5,8],
          "max_features":["log2","sqrt"],
          "min_impurity_split":[5e-6, 1e-7, 5e-7],
          "criterion": ["friedman_mse",  "mae"],
           "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
          "n_estimators":[10]}

Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(),
                           param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2)

I took a look at the verbose output in stdout:

$head gridsearch.o8475533
Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits

Based on this, it looks like there are 5842368 permutations of cross-validation pairs using my grid params.

$ grep -c  "[CV]" gridsearch.o8475533
7047332

It looks like there are around 7 million cross-validations that have been done so far but that's more than the 5842368 total fits...

7047332/5842368 = 1.2062458236

Then when I look at the stderr file:

$ cat ./gridsearch.e8475533
[Parallel(n_jobs=32)]: Done 132 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 538 tasks      | elapsed:    2.8s
[Parallel(n_jobs=32)]: Done 1104 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 1834 tasks      | elapsed:    7.9s
[Parallel(n_jobs=32)]: Done 2724 tasks      | elapsed:   11.6s
...
[Parallel(n_jobs=32)]: Done 3396203 tasks      | elapsed: 250.2min
[Parallel(n_jobs=32)]: Done 3420769 tasks      | elapsed: 276.5min
[Parallel(n_jobs=32)]: Done 3447309 tasks      | elapsed: 279.3min
[Parallel(n_jobs=32)]: Done 3484240 tasks      | elapsed: 282.3min
[Parallel(n_jobs=32)]: Done 3523550 tasks      | elapsed: 285.3min

My goal:

How can I know the progress of my gridsearch with respect to the total time it may take?

What I'm confused about:

What is the relationship between [CV] lines in stdout, total # of fits in stdout, and tasks in stderr?

922

asked Apr 13 '17 18:04

O.rka

1 Answers

Math is simple, but a little misleading at a first sight:

When each task is started logging mechanism yields a '[CV] ...' line to stdout noting about starting of execution and after task ends - another line with the addition of spent time for a particular task (in the end of the line).
Additionally, with some time intervals, logging mechanism writes a progress bar to stderr (or if you set verbose to >50 to stdout) indicating a number of completed task out of total tasks (fits) and total currently spent time, like that one:

[Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s

For your case, you have 5842368 total fits, i.e. tasks.

You counted 7047332 of '[CV] ...' which is around 7047332/2 = 3523666 finished tasks and progress bar shows exactly how many tasks are completed - 3523550 (around - because some tasks could start, but not end at the time of counting).

answered Sep 17 '22 04:09

vladkha

Related questions
                            
                                How to programmatically tell Celery to send all log messages to stdout or stderr?
                            
                                Post install script after installing a wheel
                            
                                How to plot proper 3D axes in MayaVi, like those found in Matplotlib
                            
                                What is a secure way to send an email using Python and Gmail as the provider?
                            
                                TypeError: float() argument must be a string or a number in Django distance
                            
                                Machine learning for monitoring servers
                            
                                Can a mock side_effect iterator be reset after it has been exhausted?
                            
                                How can SQLAlchemy be taught to recover from a disconnect?
                            
                                Suppress "field should be unique" error in Django REST framework
                            
                                How can a python 2 doctest fail and yet have no difference in the values in the failure message?
                            
                                Selenium: Trying to log in with cookies - "Can only set cookies for current domain"
                            
                                TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.time'
                            
                                SciPy SVD vs. Numpy SVD
                            
                                descriptor 'time' of 'datetime.datetime' object needs an argument
                            
                                Python requests API using proxy for https request get 407 Proxy Authentication Required
                            
                                Python - Plotting velocity and acceleration vectors at certain points
                            
                                Where should you update Celery settings? On the remote worker or sender?
                            
                                Gauss-Legendre over intervals -x -> infinity: adaptive algorithm to transform weights and nodes efficiently
                            
                                Fastest way to parse JSON strings into numpy arrays
                            
                                Change rows order pandas data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With