Dask prints warning to use client.scatter althought I'm using the suggested approach

Tags:

In dask distributed I get the following warning, which I would not expect:

/home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph: 
  (['int-58e78e1b34eb49a68c65b54815d1b158', 'int-5cd ... 161071d7ae7'],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s))

The reason I'm suprised is, that I'm doing exactly what the warning is suggesting:

import dask.dataframe as dd
import pandas
from dask.distributed import Client, LocalCluster

c = Client(LocalCluster())
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
filter_list = c.scatter(list(range(2,100000,2)))
mask = c.submit(dask_df['A'].isin, filter_list)
dask_df[mask.result()].compute()

So my question is: Am I doing something wrong or is this a bug?

pandas='0.22.0'
dask='0.17.0'

620

asked Feb 22 '18 14:02

dennis-w

1 Answers

The main reason why dask is complaining isn't the list, it's the pandas dataframe inside the dask dataframe.

dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)

You are creating a biggish amount of data locally when you create a pandas dataframe in your local session. Then you operate with it on the cluster. This will require moving your pandas dataframe to the cluster.

You're welcome to ignore these warnings, but in general I would not be surprised if performance here is worse than with pandas alone.

There are a few other things going on here. Your scatter of a list produces a bunch of futures, which may not be what you want. You're calling submit on a dask object, which is usually unnecessary.

195

answered Oct 25 '22 01:10

MRocklin

Related questions
                            
                                Accessing static fields from the decorated class
                            
                                VisPy animation point by point from NumPy array
                            
                                How to estimate eps using knn distance plot in DBSCAN
                            
                                Iterable object and Django StreamingHttpResponse
                            
                                Plane-plane intersection in python [closed]
                            
                                How to implement a custom layer wit multiple outputs in Keras?
                            
                                How to have limited ZMQ (ZeroMQ - PyZMQ) queue buffer size in python?
                            
                                Generating n binary vectors where each vector has a Hamming distance of d from every other vector
                            
                                AWS Elastic Beanstalk failed to install Python package using requirements.txt Git Pip
                            
                                How to run python code line by line in Spyder and include loop/if statement contents
                            
                                How do you serialize a union field in Avro using Python when attributes match
                            
                                How to make a Parameter available to all Luigi Tasks?
                            
                                pytorch variable index lost one dimension
                            
                                Fill oceans in basemap [duplicate]
                            
                                how to make a https request in python 3
                            
                                Python remove hashtag symbol and keep key words
                            
                                pandas shift rows NaNs
                            
                                Plot multiple lines with holoviews
                            
                                how to replace pixel data on same dicom file using pydicom to read it again with any dicom viewer?
                            
                                Is there a way to export Allure Report to a single html file? To share with the team

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dask prints warning to use client.scatter althought I'm using the suggested approach

Tags:

python

python-3.x

dask

dask-distributed

dennis-w

People also ask

1 Answers

MRocklin

Recent Activity

Donate For Us