In many functions from scikit-learn implemented user-friendly parallelization. For example in sklearn.cross_validation.cross_val_score
you just pass desired number of computational jobs in n_jobs
argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn
uses joblib
for parallelization, which uses multiprocessing
. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing
easy to scale oh whole MPI architecture with mpirun
utility. Can I spread computation of sklearn
functions on several computational nodes just using mpirun
and n_jobs
argument?
Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts the n_jobs parameter) using all the cores of your laptop or workstation. Training the estimators using Spark as a parallel backend for scikit-learn is most useful in the following scenarios.
First, we create two Process objects and assign them the function they will execute when they start running, also known as the target function. Second, we tell the processes to go ahead and run their tasks. And third, we wait for the processes to finish running, then continue with our program.
n_jobs is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used.
Parallel processing is a mode of operation where the task is executed simultaneously in multiple processors in the same computer. It is meant to reduce the overall processing time. In this tutorial, you'll understand the procedure to parallelize any typical logic using python's multiprocessing module.
SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the sklearn
github page for details.
Code taken from the issue page linked above.
from sklearn.externals.joblib import parallel_backend search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1) with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'): search.fit(digits.data, digits.target)
This requires that you set up a dask.distributed
scheduler and workers on your cluster. General instructions are available here: http://dask.readthedocs.io/en/latest/setup.html
ipyparallel
Code taken from the same issue page.
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend from ipyparallel import Client from ipyparallel.joblib import IPythonParallelBackend digits = load_digits() c = Client(profile='myprofile') print(c.ids) bview = c.load_balanced_view() # this is taken from the ipyparallel source code register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview)) ... with parallel_backend('ipyparallel'): search.fit(digits.data, digits.target)
Note: in both the above examples, the n_jobs
parameter seems to not matter anymore.
For SLURM the easiest way to do this is probably to use the dask-jobqueue project
>>> from dask_jobqueue import SLURMCluster >>> cluster = SLURMCluster(project='...', queue='...', ...) >>> cluster.scale(20)
You could also use dask-mpi or any of several other methods mentioned at Dask's setup documentation
Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561
You could also try the Dask-ML package, which has a RandomizedSearchCV
object that is API compatible with scikit-learn but computationally implemented on top of Dask
https://github.com/dask/dask-ml
pip install dask-ml
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With