Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel dask for loop slower than regular loop?

If I try to parallelize a for loop with dask, it ends up executing slower than the regular version. Basically, I just follow the introductory example from the dask tutorial, but for some reason it's failing on my end. What am I doing wrong?

In [1]: import numpy as np
   ...: from dask import delayed, compute
   ...: import dask.multiprocessing

In [2]: a10e4 = np.random.rand(10000, 11).astype(np.float16)
   ...: b10e4 = np.random.rand(10000, 11).astype(np.float16)

In [3]: def subtract(a, b):
   ...:     return a - b

In [4]: %%timeit
   ...: results = [subtract(a10e4, b10e4[index]) for index in range(len(b10e4))]
1 loop, best of 3: 10.6 s per loop

In [5]: %%timeit
   ...: values = [delayed(subtract)(a10e4, b10e4[index]) for index in range(len(b10e4)) ]
   ...: resultsDask = compute(*values, get=dask.multiprocessing.get)
1 loop, best of 3: 14.4 s per loop
like image 569
mistakeNot Avatar asked Feb 12 '18 15:02

mistakeNot


People also ask

Is dask faster than multiprocessing?

In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.

What is dask delayed?

The Dask delayed function decorates your functions so that they operate lazily. Rather than executing your function immediately, it will defer execution, placing the function and its arguments into a task graph. delayed ([obj, name, pure, nout, traverse]) Wraps a function or object to produce a Delayed .

Does dask use multiprocessing?

dask. bag uses the multiprocessing scheduler by default.


1 Answers

Two issues:

  1. Dask introduces about a millisecond of overhead per task. You'll want to ensure that your computations take significantly longer than that.
  2. When using the multiprocessing scheduler data gets serialized between processes, which can be quite expensive. See http://dask.pydata.org/en/latest/setup.html
like image 80
MRocklin Avatar answered Sep 17 '22 03:09

MRocklin