Easy way to use parallel options of scikit-learn functions on HPC

Tags:

In many functions from scikit-learn implemented user-friendly parallelization. For example in sklearn.cross_validation.cross_val_score you just pass desired number of computational jobs in n_jobs argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn uses joblib for parallelization, which uses multiprocessing. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing easy to scale oh whole MPI architecture with mpirun utility. Can I spread computation of sklearn functions on several computational nodes just using mpirun and n_jobs argument?

987

asked Jul 26 '16 22:07

user3271237

1 Answers

SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the sklearn github page for details.

Example using Joblib with Dask.distributed

Code taken from the issue page linked above.

from sklearn.externals.joblib import parallel_backend  search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)  with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):         search.fit(digits.data, digits.target)

This requires that you set up a dask.distributed scheduler and workers on your cluster. General instructions are available here: http://dask.readthedocs.io/en/latest/setup.html

Example using Joblib with `ipyparallel`

Code taken from the same issue page.

from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend  from ipyparallel import Client from ipyparallel.joblib import IPythonParallelBackend  digits = load_digits()  c = Client(profile='myprofile') print(c.ids) bview = c.load_balanced_view()  # this is taken from the ipyparallel source code register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))  ...  with parallel_backend('ipyparallel'):         search.fit(digits.data, digits.target)

Note: in both the above examples, the n_jobs parameter seems to not matter anymore.

Set up dask.distributed with SLURM

For SLURM the easiest way to do this is probably to use the dask-jobqueue project

>>> from dask_jobqueue import SLURMCluster >>> cluster = SLURMCluster(project='...', queue='...', ...) >>> cluster.scale(20)

You could also use dask-mpi or any of several other methods mentioned at Dask's setup documentation

Use dask.distributed directly

Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561

Try Dask-ML

You could also try the Dask-ML package, which has a RandomizedSearchCV object that is API compatible with scikit-learn but computationally implemented on top of Dask

https://github.com/dask/dask-ml

pip install dask-ml

171

answered Sep 28 '22 12:09

MRocklin

Related questions
                            
                                Extract Google Drive zip from Google colab notebook
                            
                                Detect 64bit OS (windows) in Python
                            
                                How to implement a Median-heap
                            
                                psycopg2 installation error - Library not loaded: libssl.dylib
                            
                                How to prevent a function from being overridden in python [duplicate]
                            
                                Why does "www".count("ww") return 1 and not 2? [duplicate]
                            
                                How to create a letter spacing attribute with pycairo?
                            
                                neo4j performance compared to mysql (how can it be improved?)
                            
                                opencv python documentation [closed]
                            
                                How to get lineno of "end-of-statement" in Python ast
                            
                                Complex numbers in Cython
                            
                                How to include license file in setup.py script?
                            
                                Python `yield from`, or return a generator?
                            
                                What approach(es) have you used for lightweight Python unit-tests on App Engine?
                            
                                Why is string comparison so fast in python?
                            
                                Real world typo statistics? [closed]
                            
                                How to get started with Big Data Analysis [closed]
                            
                                How do I run a Python script on my web server? [closed]
                            
                                Meaning of unittest.main() in Python unittest module
                            
                                Is Python's time.time() timezone specific?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Easy way to use parallel options of scikit-learn functions on HPC

Tags:

python

parallel-processing

multiprocessing

scikit-learn

cluster-computing

user3271237

People also ask

1 Answers

Example using Joblib with Dask.distributed

Example using Joblib with `ipyparallel`

Set up dask.distributed with SLURM

Use dask.distributed directly

Try Dask-ML

MRocklin

Recent Activity

Donate For Us

Easy way to use parallel options of scikit-learn functions on HPC

Tags:

python

parallel-processing

multiprocessing

scikit-learn

cluster-computing

user3271237

People also ask

1 Answers

Example using Joblib with Dask.distributed

Example using Joblib with ipyparallel

Set up dask.distributed with SLURM

Use dask.distributed directly

Try Dask-ML

MRocklin

Related questions

Recent Activity

Donate For Us

Example using Joblib with `ipyparallel`