I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV
with KNeighborsClassifier
. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.
There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.
This is how I set it up:
I am using jupyter-notebook, cell refers to jupyter-notebook cell.
I have loaded MNIST and used 0.05
test size for 3000
digits in a X_play
.
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)
In the next cell I have setup KNN
and a GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]
Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.
grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)
Results
Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 2.0min finished
Parallel(n_jobs=2)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=3)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=4)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=5)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=6)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=7)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=8)]: Done 18 out of 18 | elapsed: 1.4min finished
Second test
Random Forest Classifier usage was much better. Test size was 0.5
, 30000
images.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]
Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished
According to the official scikit-learn library, the n_jobs parameter is described as follows: The number of parallel jobs to run for neighbors search.
with n_jobs=1 it uses 100% of the cpu of one of the cores. Each process is run in a different core. n_jobs is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used.
cv: number of cross-validation you have to try for each selected set of hyperparameters. verbose: you can set it to 1 to get the detailed print out while you fit the data to GridSearchCV.
Observing the above time numbers, for parameter grid having 3125 combinations, the Grid Search CV took 10856 seconds (~3 hrs) whereas Halving Grid Search CV took 465 seconds (~8 mins), which is approximate 23x times faster.
Here are some reasons which might be a cause of this behaviour
n_job
n_job=1
and n_job=2
the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)n_job=3
, time was 3.4s (overall time 1.4 mins)n_job=4
, time was 3.8s (overall time 58 secs)n_job=5
, time was 4.2s (overall time 51 secs)n_job=6
, time was 4.2s (overall time ~49 secs)n_job=7
, time was 4.2s (overall time ~49 secs)n_job=8
, time was 4.2s (overall time ~49 secs)Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained with
n_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.
Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia
Finally, it might be due to the data size and the complexity of the model you use for training as well.
Here is a blog post explaining the same issue regarding multithreading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With