Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between dask=parallelized and dask=allowed in xarray's apply_ufunc?

Tags:

In the xarray documentation for the function apply_ufunc it says:

dask: ‘forbidden’, ‘allowed’ or ‘parallelized’, optional

    How to handle applying to objects containing lazy data in the form of dask arrays:

    ‘forbidden’ (default): raise an error if a dask array is encountered.
    ‘allowed’: pass dask arrays directly on to func.
    ‘parallelized’: automatically parallelize func if any of the inputs are a dask array. 
                    If used, the output_dtypes argument must also be provided. 
                    Multiple output arguments are not yet supported.

and in the documentation's page on Parallel Computing then there is a note:

For the majority of NumPy functions that are already wrapped by dask, it’s usually a better idea to use the pre-existing dask.array function, by using either a pre-existing xarray methods or apply_ufunc() with dask='allowed'. Dask can often have a more efficient implementation that makes use of the specialized structure of a problem, unlike the generic speedups offered by dask='parallelized'.

However, I'm still not clear as to what the difference between these two options is. Does allowed still operate on chunks one by one to lower memory usage? Will allowed still parallelize if the applied ufunc only uses dask operations? Why does parallelized require you to give more information about the ufunc outputs (i.e. the arguments output_dtypes, output_sizes) than allowed does?

like image 398
ThomasNicholas Avatar asked Aug 07 '18 22:08

ThomasNicholas


People also ask

Does Xarray use DASK?

Xarray is an open source project and Python package that extends the labeled data functionality of Pandas to N-dimensional array-like datasets. It shares a similar API to NumPy and Pandas and supports both Dask and NumPy arrays under the hood.

What is Python Xarray?

Xarray is a python package for working with labeled multi-dimensional (a.k.a. N-dimensional, ND) arrays, it includes functions for advanced analytics and visualization. Xarray is heavily inspired by pandas and it uses pandas internally.


1 Answers

dask='allowed' means that you're applying a function that already knows how to handle dask arrays, e.g., a function written in terms of dask.array operations. In most cases, that does indeed mean that the function will operate on chunks one by one to lower memory usage, and will apply the computation in parallel.

dask='parallelized' requires more information from the user because it creates its own wrapper that allows the provided function to act on dask arrays, by using low-level dask.array functions like atop. With dask='parallelized', you can provide a function that only knows how to handle NumPy arrays, and xarray.apply_ufunc will extend it to handle dask arrays, too.

like image 74
shoyer Avatar answered Sep 28 '22 17:09

shoyer