Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Dask support functions with multiple outputs in Custom Graphs?

Tags:

python

dask

The Custom Graphs API of Dask seems to support only functions returning one output key/value.

For example, the following dependency could not be easily represented as a Dask graph:

    B -> D
   /      \
A-         -> F
   \      /
    C -> E

This can be worked around by storing a tuple under a "composite" key (e.g. "B_C" in this case) and then splitting it by getitem() or similar. However, that can lead to inefficient execution (e.g. unnecessary serialization) and reduce the clarity of DAG visualizations.

Is there a better way or is this currently not supported?

like image 882
Petr Wolf Avatar asked Jul 15 '16 22:07

Petr Wolf


People also ask

What is high level graph?

Dask graphs produced by collections like Arrays, Bags, and DataFrames have high-level structure that can be useful for visualization and high-level optimization. The task graphs produced by these collections encode this structure explicitly as HighLevelGraph objects.

How does DASK delayed work?

The Dask delayed function decorates your functions so that they operate lazily. Rather than executing your function immediately, it will defer execution, placing the function and its arguments into a task graph. Wraps a function or object to produce a Delayed .

What is DASK cluster?

Dask Yarn is a cluster manager for traditional Hadoop systems. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is a common piece of infrastrcture in Java/Scala ecosystems for processing large volumes of data.


1 Answers

Short answer

No, but it shouldn't matter.

Programming interface

You are correct that the correct way to manage multiple outputs with Dask is to use getitem. In terms of programming interface, the standard way to do this with dask.delayed is with getitem as you suggest. Here is an example:

from dask import delayed

@delayed(pure=True)
def minmax(a, b):
    if a > b:
        return a, b
    else:
        return b, a

result = minmax(1, 2)
min, max = result[0], result[1]

Performance

You raise an interesting question about performance. In practice using the distributed scheduler (which works just fine on a single machine) should handle this sort of situation just fine without performance penalty. The same would be true for the single-machine threaded scheduler.

like image 180
MRocklin Avatar answered Oct 22 '22 23:10

MRocklin