dask: difference between client.persist and client.compute

Tags:

I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example:

In this example

from dask.distributed import Client
from dask import delayed
client = Client()

def f(*args):
    return args

result = [delayed(f)(x) for x in range(1000)]

x1 = client.compute(result)
x2 = client.persist(result)

Here x1 and x2 are different but in a less trivial calculation where result is also a list of Delayed objects, using client.persist(result) starts the calculation just like client.compute(result) does.

565

asked Jan 23 '17 12:01

johnbaltis

1 Answers

Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures

As you say, both Client.compute and Client.persist take lazy Dask collections and start them running on the cluster. They differ in what they return.

Client.persist returns a copy for each of the dask collections with their previously-lazy computations now submitted to run on the cluster. The task graphs of these collections now just point to the currently running Future objects.

So if you persist a dask dataframe with 100 partitions you get back a dask dataframe with 100 partitions, with each partition pointing to a future currently running on the cluster.
Client.compute returns a single Future for each collection. This future refers to a single Python object result collected on one worker. This typically used for small results.

So if you compute a dask.dataframe with 100 partitions you get back a Future pointing to a single Pandas dataframe that holds all of the data

More pragmatically, I recommend using persist when your result is large and needs to be spread among many computers and using compute when your result is small and you want it on just one computer.

In practice I rarely use Client.compute, preferring instead to use persist for intermediate staging and dask.compute to pull down final results.

df = dd.read_csv('...')
df = df[df.name == 'alice']
df = df.persist()  # compute up to here, keep results in memory

>>> df.value.max().compute()
100

>>> df.value.min().compute()
0

When using delayed

Delayed objects only have one "partition" regardless, so compute and persist are more interchangble. Persist will give you back a lazy dask.delayed object while compute will give you back an immediate Future object.

112

answered Oct 07 '22 06:10

MRocklin

Related questions
                            
                                Trying to remove a view index above child count error
                            
                                How to use logic operators in jinja template on salt-stack (AND, OR)
                            
                                How to use a public keypair .pem file for ansible playbooks?
                            
                                How to use file parameter in jenkins
                            
                                Python: Check if dataframe column contain string type
                            
                                Efficiently count the number of bits in an integer in JavaScript
                            
                                Ecto/Elixir, How can I query by date?
                            
                                .attr vs .classed in D3.js
                            
                                Exporting environment variables from one stage to the next in GitLab CI
                            
                                Get size of specific repository in Nexus 3
                            
                                Unable to start LiveReload server
                            
                                What is the difference between auto-fill and auto-fit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

dask: difference between client.persist and client.compute

Tags:

python

dask

johnbaltis

People also ask

1 Answers

When using delayed

MRocklin

Recent Activity

Donate For Us