I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic. First, is this the correct way to run a for-loop in parallel? <pre class="prettyprint"><code>%%time list_names=['a','b','c','d'] keep_return=[] @delayed def loop_dummy(target): for i in range (1000000000): pass print('passed value is:'+target) return(1) for i in list_names: c=loop_dummy(i) keep_return.append(c) total = delayed(sum)(keep_return) total.compute() </code></pre> This produced <pre class="prettyprint"><code>passed value is:a passed value is:b passed value is:c passed value is:d Wall time: 1min 53s </code></pre> If I run this in serial, <pre class="prettyprint"><code>%%time list_names=['a','b','c','d'] keep_return=[] def loop_dummy(target): for i in range (1000000000): pass print('passed value is:'+target) return(1) for i in list_names: c=loop_dummy(i) keep_return.append(c) </code></pre> it is actually faster. <pre class="prettyprint"><code>passed value is:a passed value is:b passed value is:c passed value is:d Wall time: 1min 49s </code></pre> I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no? My actual for loop involves heavier computation where I build a model for various targets.

This computation <pre class="prettyprint"><code>for i in range(...): pass </code></pre> Is bound by the global interpreter lock (GIL). You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following: <pre class="prettyprint"><code>total.compute(scheduler='multiprocessing') </code></pre> However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice. More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

Dask For Loop In Parallel

Tags:

dask

dask-delayed

I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.

First, is this the correct way to run a for-loop in parallel?

%%time

list_names=['a','b','c','d']
keep_return=[]

@delayed
def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)


total = delayed(sum)(keep_return)
total.compute()

This produced

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s

If I run this in serial,

%%time

list_names=['a','b','c','d']
keep_return=[]


def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)

it is actually faster.

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s

I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?

My actual for loop involves heavier computation where I build a model for various targets.

974

asked Jun 29 '18 23:06

B_Miner

1 Answers

This computation

for i in range(...):
    pass

Is bound by the global interpreter lock (GIL). You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:

total.compute(scheduler='multiprocessing')

However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.

More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

125

answered Sep 21 '22 10:09

MRocklin

Related questions
                            
                                Loading local file from client onto dask distributed cluster
                            
                                Dask fails with freeze_support bug
                            
                                Dask, create a dataframe from several dask arrays
                            
                                dask dataframe head() returns empty df
                            
                                Choosing a framework for larger than memory data analysis with python
                            
                                Python PANDAS: Converting from pandas/numpy to dask dataframe/array
                            
                                Is there a way to get the nlargest items per group in dask?
                            
                                How to apply funtion to single Column of large dataset using Dask?
                            
                                Where is the pydata BLAZE project heading?
                            
                                Does Dask support functions with multiple outputs in Custom Graphs?
                            
                                How to parallelize groupby() in dask?
                            
                                Slicing a Dask Dataframe
                            
                                Fastest way to get the minimum value of data array in another paired bin array
                            
                                Howto copy a dask dataframe?
                            
                                Dask "no module named xxxx" error
                            
                                Dask delayed object of unspecified length not iterable error when combining dictionaries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With