Increasing n_jobs has no effect on GridSearchCV

Tags:

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.

There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.

This is how I set it up:

I am using jupyter-notebook, cell refers to jupyter-notebook cell.

I have loaded MNIST and used 0.05 test size for 3000 digits in a X_play.

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist['target']

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)

In the next cell I have setup KNN and a GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.

grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)

Results

Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.0min finished
Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=3)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=4)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=5)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=6)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=7)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=8)]: Done  18 out of  18 | elapsed:  1.4min finished

Second test

Random Forest Classifier usage was much better. Test size was 0.5, 30000 images.

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]

Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished

976

asked Jun 22 '18 18:06

Kocur4d

1 Answers

Here are some reasons which might be a cause of this behaviour

With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
- when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
- when n_job=3, time was 3.4s (overall time 1.4 mins)
- when n_job=4, time was 3.8s (overall time 58 secs)
- when n_job=5, time was 4.2s (overall time 51 secs)
- when n_job=6, time was 4.2s (overall time ~49 secs)
- when n_job=7, time was 4.2s (overall time ~49 secs)
- when n_job=8, time was 4.2s (overall time ~49 secs)
Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.
Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia
Finally, it might be due to the data size and the complexity of the model you use for training as well.

Here is a blog post explaining the same issue regarding multithreading.

answered Nov 14 '22 21:11

Gambit1614

Related questions
                            
                                Run py.test test in different process
                            
                                Can one have Python receive a variable-length string array from C#?
                            
                                In the Django REST framework, how are the default permission classes combined with per-view(set) ones?
                            
                                Questions on using ttk.Style()?
                            
                                How to incorporate data from two distinct sources (that don't have a RDBMS relationship) in a single serializer?
                            
                                Where is the API documentation for boto3 resources?
                            
                                matplotlib's xkcd() not working
                            
                                How to plot the difference of two distributions in a seaborn?
                            
                                Difference between 'ctx' and 'self' in python?
                            
                                Pandas Data Frame how to merge columns
                            
                                Pip install pandas: installing dependencies error
                            
                                How to convert requests.RequestsCookieJar to string
                            
                                Download pretrained ImageNet model of ResNet, VGG, etc. (.PB file)
                            
                                python3 command not found after installing python with pyenv
                            
                                plot mouse movement Python
                            
                                Split Column containing lists into different rows in pandas [duplicate]
                            
                                How to use pyunpack to unpack .7z file?
                            
                                how to efficiently split a large dataframe into many parquet files?
                            
                                how to get standardised (Beta) coefficients for multiple linear regression using statsmodels
                            
                                How to test a custom loss function in keras?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Increasing n_jobs has no effect on GridSearchCV

Tags:

python

multithreading

scikit-learn

knn

Kocur4d

People also ask

1 Answers

Gambit1614

Recent Activity

Donate For Us