I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.
First, is this the correct way to run a for-loop in parallel?
%%time
list_names=['a','b','c','d']
keep_return=[]
@delayed
def loop_dummy(target):
for i in range (1000000000):
pass
print('passed value is:'+target)
return(1)
for i in list_names:
c=loop_dummy(i)
keep_return.append(c)
total = delayed(sum)(keep_return)
total.compute()
This produced
passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s
If I run this in serial,
%%time
list_names=['a','b','c','d']
keep_return=[]
def loop_dummy(target):
for i in range (1000000000):
pass
print('passed value is:'+target)
return(1)
for i in list_names:
c=loop_dummy(i)
keep_return.append(c)
it is actually faster.
passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s
I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?
My actual for loop involves heavier computation where I build a model for various targets.
Use Cases of Dask Parallelizing data science apps: to achieve parallelism in any data science and ML solution, Dask is the preferred choice because parallelism is not limited to a single application. You can also parallelize multiple applications on the same hardware/cluster.
dask. bag uses the multiprocessing scheduler by default.
In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.
This computation
for i in range(...):
pass
Is bound by the global interpreter lock (GIL). You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:
total.compute(scheduler='multiprocessing')
However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.
More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With