Here is the pstree
output of my current running GridSearch, I am curious to see what processes are going on, and there is something I cannot explain yet.
├─bash─┬─perl───20*[bash───python─┬─5*[python───31*[{python}]]]
│ │ └─11*[{python}]]
│ └─tee
└─bash───pstree
I removed stuff that is unrelated.Curly braces mean threads.
parallel -j 20
to start my python jobs. As you can see, 20*
indeed shows there are 20 processes.bash
process before each of the python processes is due to activation of Anaconda virtual environment with source activate venv
.5*
) spawned. This is because I specified n_jobs=5
to GridSearchCV
.My understanding ends here.
Question: can anyone explain why are there another 11 python threads (11*[{python}]
) along with grid search, and 31 python threads (31*[{python}]
) spawned inside each of the 5 grid search jobs?
Update: added the code for calling GridSearchCV
Cs = 10 ** np.arange(-2, 2, 0.1)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = LogisticRegression()
gs = GridSearchCV(
clf,
param_grid={'C': Cs, 'penalty': ['l1'],
'tol': [1e-10], 'solver': ['liblinear']},
cv=skf,
scoring='neg_log_loss',
n_jobs=5,
verbose=1,
refit=True)
gs.fit(Xs, ys)
Update (2017-09-27):
I wrapped up a test code on gist for you to easily reproduce if interested.
I tested the same code on a Mac Pro and multiple linux machines, and reproduced @igrinis' result, but only on the Mac Pro. On the linux machines, I get different numbers than before, but consistently. So the number of threads spawned may depend on the particular data feed to GridSearchCV.
python─┬─5*[python───31*[{python}]]
└─3*[{python}]
Note that the pstree installed by homebrew/linuxbrew on Mac Pro and linux machines are different. Here I post the exact versions I used:
Mac:
pstree $Revision: 2.39 $ by Fred Hucht (C) 1993-2015
EMail: fred AT thp.uni-due.de
Linux:
pstree (PSmisc) 22.20
Copyright (C) 1993-2009 Werner Almesberger and Craig Small
The Mac version doesn't seem to have an option to show threads, which I thought could be why they are not seen in the result. I haven't found a way to inspect threads on Mac Pro easily yet. If you happen to know a way, please comment.
Update (2017-10-12)
In another set of experiment, I confirmed that setting the environment variable OMP_NUM_THREADS
makes a difference.
Before export OMP_NUM_THREADS=1
, there are many (63 in this case) threads without unclear use spawned as described above:
bash───python─┬─23*[python───63*[{python}]]
└─3*[{python}]
No use of linux parallel
here. n_jobs=23
.
After export OMP_NUM_THREADS=1
, no threads spawned, but the 3 Python processes are still there, whose use I am still unaware of.
bash───python─┬─23*[python]
└─3*[{python}]
I initially came across OMP_NUM_THREADS
because it causes error in some of my GridSearchCV jobs, error messages are something like this
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
The GridSearchCV class in Scikit-Learn is an amazing tool to help you tune your model’s hyper-parameters. In this tutorial, you learned what hyper-parameters are and what the process of tuning them looks like. You then explored sklearn’s GridSearchCV class and its various parameters.
The other two parameters in the grid search is where the limitations come in to play. The results of GridSearchCV can be somewhat misleading the first time around. The best combination of parameters found is more of a conditional “best” combination. This is due to the fact that the search can only test the parameters that you fed into param_grid.
One of the tools available to you in your search for the best model is Scikit-Learn’s GridSearchCV class. Why hyper-parameter tuning is important in building successful machine learning models How GridSearchCV is an incredible tool in exploring the hyper-parameters of your dataset
Before this project, I had the idea that hyperparameter tuning using scikit-learn’s GridSearchCV was the greatest invention of all time. It runs through all the different parameters that is fed into the parameter grid and produces the best combination of parameters, based on a scoring metric of your choice (accuracy, f1, etc).
From sklearn.GridSearchCV
doc:
n_jobs : int, default=1 Number of jobs to run in parallel.
pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
If I understand documentation properly, the GridSearchCV
spawns a bunch of threads as number of grid points, and only runs n_jobs
simultaneously. Number 31 I believe is some kind of cap limit of your 40 possible values. Try to play with value of pre_dispatch
parameter.
Another 11 threads I believe have nothing to do with the GridSearchCV
itself, as it is shown on the same level. I think it is leftovers of other commands.
By the way, I don't observe such behavior on Mac (only see 5 processes spawn by the GridSearchCV
as one would expect) so it may come from incompatible libraries. Try updating sklearn
and numpy
manually.
Here is my pstree
output (part of the path deleted for privacy):
└─┬= 00396 *** -fish
└─┬= 21743 *** python /Users/***/scratch_5.py
├─── 21775 *** python /Users/***/scratch_5.py
├─── 21776 *** python /Users/***/scratch_5.py
├─── 21777 *** python /Users/***/scratch_5.py
├─── 21778 *** python /Users/***/scratch_5.py
└─── 21779 *** python /Users/***/scratch_5.py
answer to the second comment:
That's your code actually. Just generated separable 1d two class problem:
N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )
100k samples was enough to get CPU busy for about a minute.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With