Why GridSearchCV in scikit-learn spawn so many threads

Tags:

Here is the pstree output of my current running GridSearch, I am curious to see what processes are going on, and there is something I cannot explain yet.

 ├─bash─┬─perl───20*[bash───python─┬─5*[python───31*[{python}]]]
 │      │                          └─11*[{python}]]
 │      └─tee
 └─bash───pstree

I removed stuff that is unrelated.Curly braces mean threads.

The appearance of perl is because I used parallel -j 20 to start my python jobs. As you can see, 20* indeed shows there are 20 processes.
A bash process before each of the python processes is due to activation of Anaconda virtual environment with source activate venv.
Inside each python process, there are another 5 python processes (5*) spawned. This is because I specified n_jobs=5 to GridSearchCV.

My understanding ends here.

Question: can anyone explain why are there another 11 python threads (11*[{python}]) along with grid search, and 31 python threads (31*[{python}]) spawned inside each of the 5 grid search jobs?

Update: added the code for calling GridSearchCV

Cs = 10 ** np.arange(-2, 2, 0.1)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = LogisticRegression()
gs = GridSearchCV(
    clf,
    param_grid={'C': Cs, 'penalty': ['l1'],
                'tol': [1e-10], 'solver': ['liblinear']},
    cv=skf,
    scoring='neg_log_loss',
    n_jobs=5,
    verbose=1,
    refit=True)
gs.fit(Xs, ys)

Update (2017-09-27):

I wrapped up a test code on gist for you to easily reproduce if interested.

I tested the same code on a Mac Pro and multiple linux machines, and reproduced @igrinis' result, but only on the Mac Pro. On the linux machines, I get different numbers than before, but consistently. So the number of threads spawned may depend on the particular data feed to GridSearchCV.

python─┬─5*[python───31*[{python}]]
       └─3*[{python}]

Note that the pstree installed by homebrew/linuxbrew on Mac Pro and linux machines are different. Here I post the exact versions I used:

Mac:

pstree $Revision: 2.39 $ by Fred Hucht (C) 1993-2015
EMail: fred AT thp.uni-due.de

Linux:

pstree (PSmisc) 22.20
Copyright (C) 1993-2009 Werner Almesberger and Craig Small

The Mac version doesn't seem to have an option to show threads, which I thought could be why they are not seen in the result. I haven't found a way to inspect threads on Mac Pro easily yet. If you happen to know a way, please comment.

Update (2017-10-12)

In another set of experiment, I confirmed that setting the environment variable OMP_NUM_THREADS makes a difference.

Before export OMP_NUM_THREADS=1, there are many (63 in this case) threads without unclear use spawned as described above:

bash───python─┬─23*[python───63*[{python}]]
              └─3*[{python}]

No use of linux parallel here. n_jobs=23.

After export OMP_NUM_THREADS=1, no threads spawned, but the 3 Python processes are still there, whose use I am still unaware of.

bash───python─┬─23*[python]
              └─3*[{python}]

I initially came across OMP_NUM_THREADS because it causes error in some of my GridSearchCV jobs, error messages are something like this

OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

275

asked Sep 21 '17 18:09

zyxue

1 Answers

From sklearn.GridSearchCV doc:

n_jobs : int, default=1 Number of jobs to run in parallel.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

If I understand documentation properly, the GridSearchCV spawns a bunch of threads as number of grid points, and only runs n_jobs simultaneously. Number 31 I believe is some kind of cap limit of your 40 possible values. Try to play with value of pre_dispatch parameter.

Another 11 threads I believe have nothing to do with the GridSearchCV itself, as it is shown on the same level. I think it is leftovers of other commands.

By the way, I don't observe such behavior on Mac (only see 5 processes spawn by the GridSearchCV as one would expect) so it may come from incompatible libraries. Try updating sklearn and numpy manually.

Here is my pstree output (part of the path deleted for privacy):

 └─┬= 00396 *** -fish
   └─┬= 21743 *** python /Users/***/scratch_5.py
     ├─── 21775 *** python /Users/***/scratch_5.py
     ├─── 21776 *** python /Users/***/scratch_5.py
     ├─── 21777 *** python /Users/***/scratch_5.py
     ├─── 21778 *** python /Users/***/scratch_5.py
     └─── 21779 *** python /Users/***/scratch_5.py

answer to the second comment:

That's your code actually. Just generated separable 1d two class problem:

N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )

100k samples was enough to get CPU busy for about a minute.

answered Nov 15 '22 21:11

igrinis

Related questions
                            
                                Why does @abstractmethod need to be used in a class whose metaclass is derived from ABCMeta?
                            
                                Python: How to deep copy a list of dictionaries
                            
                                ODBC Driver 13 for SQL Server can't open lib on pyodbc while connecting on AWS E2 ubuntu instance
                            
                                How to use a tensorflow model extracted from a trained keras model
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does .loc behave differently depending on whether values are printed or assigned?
                            
                                How to read two lines from a file and create dynamics keys in a for-loop, a follow-up
                            
                                Random number generator differs between tensorflow 1.0.1 and 0.12.1
                            
                                PyCharm PEP8 Code Style highlights not working
                            
                                frequency axis in continuous wavelet transform plot (scaleogram) in python
                            
                                Python multiprocessing queue get() timeout despite full queue
                            
                                python KDE get contours and paths into specific json format leaflet-friendly
                            
                                Boost python getter/setter with the same name
                            
                                Auto-sklearn installation error
                            
                                What is a faster way to get the location of unique rows in numpy
                            
                                Python selenium send_keys emoji support
                            
                                Bokeh Interactive legend hide multiple glyphs
                            
                                How do I achieve sprintf-style formatting for bytes objects in python 3?
                            
                                Compact but pretty JSON output in python?
                            
                                How to extract text from a Specific Area in a PDF using Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why GridSearchCV in scikit-learn spawn so many threads

Tags:

python

multithreading

scikit-learn

grid-search

zyxue

People also ask

1 Answers

igrinis

Recent Activity

Donate For Us