Increasing n_jobs has no effect on GridSearchCV

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.

There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.

This is how I set it up:

I am using jupyter-notebook, cell refers to jupyter-notebook cell.

I have loaded MNIST and used 0.05 test size for 3000 digits in a X_play.

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist['target']

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)

In the next cell I have setup KNN and a GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.

grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)


Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.0min finished
Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=3)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=4)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=5)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=6)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=7)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=8)]: Done  18 out of  18 | elapsed:  1.4min finished

Second test

Random Forest Classifier usage was much better. Test size was 0.5, 30000 images.

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]

Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished
1 Answers

Here are some reasons which might be a cause of this behaviour

  • With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
    • when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
    • when n_job=3, time was 3.4s (overall time 1.4 mins)
    • when n_job=4, time was 3.8s (overall time 58 secs)
    • when n_job=5, time was 4.2s (overall time 51 secs)
    • when n_job=6, time was 4.2s (overall time ~49 secs)
    • when n_job=7, time was 4.2s (overall time ~49 secs)
    • when n_job=8, time was 4.2s (overall time ~49 secs)
  • Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.

  • Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.

  • I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula enter image description here Image Source : Ahmdal's Law : Wikipedia

  • Finally, it might be due to the data size and the complexity of the model you use for training as well.

Here is a blog post explaining the same issue regarding multithreading.

