How to efficiently submit tasks with large arguments in Dask distributed?

Tags:

I want to submit functions with Dask that have large (gigabyte scale) arguments. What is the best way to do this? I want to run this function many times with different (small) parameters.

Example (bad)

This uses the concurrent.futures interface. We could use the dask.delayed interface just as easily.

x = np.random.random(size=100000000)  # 800MB array
params = list(range(100))             # 100 small parameters

def f(x, param):
    pass

from dask.distributed import Client
c = Client()

futures = [c.submit(f, x, param) for param in params]

But this is slower than I would expect or results in memory errors.

411

asked Jan 04 '17 18:01

MRocklin

1 Answers

OK, so what's wrong here is that each task contains the numpy array x, which is large. For each of the 100 tasks that we submit we need to serialize x, send it up to the scheduler, send it over to the worker, etc..

Instead, we'll send the array up to the cluster once:

[future] = c.scatter([x])

Now future is a token that points to an array x that lives on the cluster. Now we can submit tasks that refer to this remote future, instead of the numpy array on our local client.

# futures = [c.submit(f, x, param) for param in params]  # sends x each time
futures = [c.submit(f, future, param) for param in params]  # refers to remote x already on cluster

This is now much faster, and lets Dask control data movement more effectively.

Scatter data to all workers

If you expect to need to move the array x to all workers eventually then you may want to broadcast the array to start

[future] = c.scatter([x], broadcast=True)

Use Dask Delayed

Futures work fine with dask.delayed as well. There is no performance benefit here, but some people prefer this interface:

# futures = [c.submit(f, future, param) for param in params]

from dask import delayed
lazy_values = [delayed(f)(future, param) for param in params]
futures = c.compute(lazy_values)

149

answered Sep 19 '22 13:09

MRocklin

Related questions
                            
                                Immediate debounce in Rx
                            
                                Why is the __dict__ of instances so much smaller in size in Python 3?
                            
                                git rebase origin/develop vs git rebase develop
                            
                                Spring JPA Specification with Sort
                            
                                Linux pipe audio file to microphone input
                            
                                Property 'value' does not exist on type 'ElementRef'
                            
                                Does installing .NET Framework 4.7 eliminate the need to install 4.6.x?
                            
                                Disable Cloud Functions for Firebase through Firebase dashboard (or cli)
                            
                                Operator overloading in Python: handling different types and order of parameters [duplicate]
                            
                                Downloading file with pysftp
                            
                                Why tensorflow uses channel-last ordering instead of row-major?
                            
                                The "Xamarin.Build.Download.XamarinBuildAndroidAarProguardConfigs" task could not be loaded from the assembly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently submit tasks with large arguments in Dask distributed?

Tags:

Example (bad)

MRocklin

People also ask

1 Answers

Scatter data to all workers

Use Dask Delayed

MRocklin

Recent Activity

Donate For Us