I am confused about what the difference is between client.persist()
and client.compute()
both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example:
In this example
from dask.distributed import Client
from dask import delayed
client = Client()
def f(*args):
return args
result = [delayed(f)(x) for x in range(1000)]
x1 = client.compute(result)
x2 = client.persist(result)
Here x1
and x2
are different but in a less trivial calculation where result
is also a list of Delayed
objects, using client.persist(result)
starts the calculation just like client.compute(result)
does.
Dask DataFrames are composed of a collection of underlying pandas DataFrames (partitions). compute() concatenates all the Dask DataFrame partitions into a single pandas DataFrame.
The reason dask dataframe is taking more time to compute (shape or any operation) is because when a compute op is called, dask tries to perform operations from the creation of the current dataframe or it's ancestors to the point where compute() is called.
The Client connects users to a Dask cluster. It provides an asynchronous user interface around functions and futures. This class resembles executors in concurrent. futures but also allows Future objects within submit/map calls.
Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures
As you say, both Client.compute
and Client.persist
take lazy Dask collections and start them running on the cluster. They differ in what they return.
Client.persist returns a copy for each of the dask collections with their previously-lazy computations now submitted to run on the cluster. The task graphs of these collections now just point to the currently running Future objects.
So if you persist a dask dataframe with 100 partitions you get back a dask dataframe with 100 partitions, with each partition pointing to a future currently running on the cluster.
Client.compute returns a single Future for each collection. This future refers to a single Python object result collected on one worker. This typically used for small results.
So if you compute a dask.dataframe with 100 partitions you get back a Future pointing to a single Pandas dataframe that holds all of the data
More pragmatically, I recommend using persist when your result is large and needs to be spread among many computers and using compute when your result is small and you want it on just one computer.
In practice I rarely use Client.compute
, preferring instead to use persist for intermediate staging and dask.compute
to pull down final results.
df = dd.read_csv('...')
df = df[df.name == 'alice']
df = df.persist() # compute up to here, keep results in memory
>>> df.value.max().compute()
100
>>> df.value.min().compute()
0
Delayed objects only have one "partition" regardless, so compute and persist are more interchangble. Persist will give you back a lazy dask.delayed object while compute will give you back an immediate Future object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With