Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why GridSearchCV in scikit-learn spawn so many threads

Here is the pstree output of my current running GridSearch, I am curious to see what processes are going on, and there is something I cannot explain yet.

 ├─bash─┬─perl───20*[bash───python─┬─5*[python───31*[{python}]]]
 │      │                          └─11*[{python}]]
 │      └─tee
 └─bash───pstree

I removed stuff that is unrelated.Curly braces mean threads.

  • The appearance of perl is because I used parallel -j 20 to start my python jobs. As you can see, 20* indeed shows there are 20 processes.
  • A bash process before each of the python processes is due to activation of Anaconda virtual environment with source activate venv.
  • Inside each python process, there are another 5 python processes (5*) spawned. This is because I specified n_jobs=5 to GridSearchCV.

My understanding ends here.

Question: can anyone explain why are there another 11 python threads (11*[{python}]) along with grid search, and 31 python threads (31*[{python}]) spawned inside each of the 5 grid search jobs?

Update: added the code for calling GridSearchCV

Cs = 10 ** np.arange(-2, 2, 0.1)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = LogisticRegression()
gs = GridSearchCV(
    clf,
    param_grid={'C': Cs, 'penalty': ['l1'],
                'tol': [1e-10], 'solver': ['liblinear']},
    cv=skf,
    scoring='neg_log_loss',
    n_jobs=5,
    verbose=1,
    refit=True)
gs.fit(Xs, ys)

Update (2017-09-27):

I wrapped up a test code on gist for you to easily reproduce if interested.

I tested the same code on a Mac Pro and multiple linux machines, and reproduced @igrinis' result, but only on the Mac Pro. On the linux machines, I get different numbers than before, but consistently. So the number of threads spawned may depend on the particular data feed to GridSearchCV.

python─┬─5*[python───31*[{python}]]
       └─3*[{python}]

Note that the pstree installed by homebrew/linuxbrew on Mac Pro and linux machines are different. Here I post the exact versions I used:

Mac:

pstree $Revision: 2.39 $ by Fred Hucht (C) 1993-2015
EMail: fred AT thp.uni-due.de

Linux:

pstree (PSmisc) 22.20
Copyright (C) 1993-2009 Werner Almesberger and Craig Small

The Mac version doesn't seem to have an option to show threads, which I thought could be why they are not seen in the result. I haven't found a way to inspect threads on Mac Pro easily yet. If you happen to know a way, please comment.

Update (2017-10-12)

In another set of experiment, I confirmed that setting the environment variable OMP_NUM_THREADS makes a difference.

Before export OMP_NUM_THREADS=1, there are many (63 in this case) threads without unclear use spawned as described above:

bash───python─┬─23*[python───63*[{python}]]
              └─3*[{python}]

No use of linux parallel here. n_jobs=23.

After export OMP_NUM_THREADS=1, no threads spawned, but the 3 Python processes are still there, whose use I am still unaware of.

bash───python─┬─23*[python]
              └─3*[{python}]

I initially came across OMP_NUM_THREADS because it causes error in some of my GridSearchCV jobs, error messages are something like this

OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
like image 275
zyxue Avatar asked Sep 21 '17 18:09

zyxue


People also ask

What is the gridsearchcv class in scikit-learn?

The GridSearchCV class in Scikit-Learn is an amazing tool to help you tune your model’s hyper-parameters. In this tutorial, you learned what hyper-parameters are and what the process of tuning them looks like. You then explored sklearn’s GridSearchCV class and its various parameters.

What are the limitations of the gridsearchcv grid search?

The other two parameters in the grid search is where the limitations come in to play. The results of GridSearchCV can be somewhat misleading the first time around. The best combination of parameters found is more of a conditional “best” combination. This is due to the fact that the search can only test the parameters that you fed into param_grid.

What is the best machine learning model in scikit-learn?

One of the tools available to you in your search for the best model is Scikit-Learn’s GridSearchCV class. Why hyper-parameter tuning is important in building successful machine learning models How GridSearchCV is an incredible tool in exploring the hyper-parameters of your dataset

Is hyperparameter tuning using gridsearchcv the greatest invention of all time?

Before this project, I had the idea that hyperparameter tuning using scikit-learn’s GridSearchCV was the greatest invention of all time. It runs through all the different parameters that is fed into the parameter grid and produces the best combination of parameters, based on a scoring metric of your choice (accuracy, f1, etc).


1 Answers

From sklearn.GridSearchCV doc:

n_jobs : int, default=1 Number of jobs to run in parallel.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

If I understand documentation properly, the GridSearchCV spawns a bunch of threads as number of grid points, and only runs n_jobs simultaneously. Number 31 I believe is some kind of cap limit of your 40 possible values. Try to play with value of pre_dispatch parameter.

Another 11 threads I believe have nothing to do with the GridSearchCV itself, as it is shown on the same level. I think it is leftovers of other commands.

By the way, I don't observe such behavior on Mac (only see 5 processes spawn by the GridSearchCV as one would expect) so it may come from incompatible libraries. Try updating sklearn and numpy manually.

Here is my pstree output (part of the path deleted for privacy):

 └─┬= 00396 *** -fish
   └─┬= 21743 *** python /Users/***/scratch_5.py
     ├─── 21775 *** python /Users/***/scratch_5.py
     ├─── 21776 *** python /Users/***/scratch_5.py
     ├─── 21777 *** python /Users/***/scratch_5.py
     ├─── 21778 *** python /Users/***/scratch_5.py
     └─── 21779 *** python /Users/***/scratch_5.py

answer to the second comment:

That's your code actually. Just generated separable 1d two class problem:

N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )

100k samples was enough to get CPU busy for about a minute.

like image 63
igrinis Avatar answered Nov 15 '22 21:11

igrinis