I am trying to use dask to do some embarassingly parallel processing. For some reaason I have to use dask but the task could be easily achieved using multiprocessing.Pool(5).map
.
For example:
import dask
from dask import compute, delayed
def do_something(x): return x * x
data = range(10)
delayed_values = [delayed(do_something)(x) for x in data]
results = compute(*delayed_values, scheduler='processes')
It works, but apparently it uses only one process.
How can I configure dask so it uses a pool of 5 processes for this computation?
You can use the num_workers
parameter to specify the number of processes for the compute
method.
results = compute(*delayed_values, scheduler='processes', num_workers=5)
you can configure it to use a custom process pool as such:
import dask
from multiprocessing.pool import Pool
dask.config.set(pool=Pool(5))
or as a context manager:
with dask.config.set(scheduler='processes', num_workers=5):
...
you may want to read this dask_scheduling
or my previous answer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With