Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask For Loop In Parallel

I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.

First, is this the correct way to run a for-loop in parallel?

%%time

list_names=['a','b','c','d']
keep_return=[]

@delayed
def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)


total = delayed(sum)(keep_return)
total.compute()

This produced

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s

If I run this in serial,

%%time

list_names=['a','b','c','d']
keep_return=[]


def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)

it is actually faster.

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s

I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?

My actual for loop involves heavier computation where I build a model for various targets.

like image 974
B_Miner Avatar asked Jun 29 '18 23:06

B_Miner


People also ask

Is dask apply parallel?

Use Cases of Dask Parallelizing data science apps: to achieve parallelism in any data science and ML solution, Dask is the preferred choice because parallelism is not limited to a single application. You can also parallelize multiple applications on the same hardware/cluster.

Does dask use multiprocessing?

dask. bag uses the multiprocessing scheduler by default.

Is dask faster than multiprocessing?

In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.


1 Answers

This computation

for i in range(...):
    pass

Is bound by the global interpreter lock (GIL). You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:

total.compute(scheduler='multiprocessing')

However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.

More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

like image 125
MRocklin Avatar answered Sep 21 '22 10:09

MRocklin