Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn: general question about parallel computing

Tags:

scikit-learn

I would like to use sklearn.grid_search.GridSearchCV() on multiple processors in parallel. This is the first time I will do this, but my initial tests show that it seems to be working.

I am trying to understand this part of the documentation:

n_jobs : int, default 1

Number of jobs to run in parallel.

pre_dispatch : int, or string, optional

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

Can someone break this down for me? I'm having trouble understanding the difference between n_jobs and pre_dispatch. If I set n_jobs = 5 and pre-dispatch=2, how is this different from just setting n_jobs=2?

like image 399
Fequish Avatar asked Sep 19 '15 19:09

Fequish


People also ask

Is scikit-learn parallel?

Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts the n_jobs parameter) using all the cores of your laptop or workstation. Training the estimators using Spark as a parallel backend for scikit-learn is most useful in the following scenarios.

Does Sklearn use multiple cores?

Some scikit-learn estimators and utilities can parallelize costly operations using multiple CPU cores, thanks to the following components: via the joblib library. In this case the number of threads or processes can be controlled with the n_jobs parameter.

Is Sklearn multithreaded?

No. All scikit-learn estimators will by default work on a single thread only.

What are requirements for working with data in scikit-learn?

Requirements for working with data in scikit learnFeatures = predictor variables = independent variables. Target variable = dependent variable = response variable. Samples=records=instances.


1 Answers

Suppose you are using GridSearchCV for KNN with parameters' grid: k=[1,2,3,4,5, ... 1000].

Even when you set n_jobs=2, GridSearchCV will first create 1000 jobs, each with one choice of your k, also making 1000 copies of your data (possibly blowing up your memory if your data is big), then sending those 1000 jobs to 2 CPUs (most jobs will be pending of course).

GridSearchCV doesn't just spawn 2 jobs for 2 CPUs because the process of spawing jobs on-demand is expensive. It directly spawns equal amount of jobs as parameter combinations you have (1000 in this case).

In this sense, the wording n_jobs might be misleading. Now, using pre_dispatch you can set how many pre-dispatched jobs you want to spawn.

like image 61
Michael Avatar answered Sep 29 '22 09:09

Michael