In the xarray documentation for the function apply_ufunc it says:
dask: ‘forbidden’, ‘allowed’ or ‘parallelized’, optional
How to handle applying to objects containing lazy data in the form of dask arrays:
‘forbidden’ (default): raise an error if a dask array is encountered.
‘allowed’: pass dask arrays directly on to func.
‘parallelized’: automatically parallelize func if any of the inputs are a dask array.
If used, the output_dtypes argument must also be provided.
Multiple output arguments are not yet supported.
and in the documentation's page on Parallel Computing then there is a note:
For the majority of NumPy functions that are already wrapped by dask, it’s usually a better idea to use the pre-existing dask.array function, by using either a pre-existing xarray methods or apply_ufunc() with dask='allowed'. Dask can often have a more efficient implementation that makes use of the specialized structure of a problem, unlike the generic speedups offered by dask='parallelized'.
However, I'm still not clear as to what the difference between these two options is. Does allowed
still operate on chunks one by one to lower memory usage? Will allowed
still parallelize if the applied ufunc only uses dask operations? Why does parallelized
require you to give more information about the ufunc outputs (i.e. the arguments output_dtypes
, output_sizes
) than allowed
does?
Xarray is an open source project and Python package that extends the labeled data functionality of Pandas to N-dimensional array-like datasets. It shares a similar API to NumPy and Pandas and supports both Dask and NumPy arrays under the hood.
Xarray is a python package for working with labeled multi-dimensional (a.k.a. N-dimensional, ND) arrays, it includes functions for advanced analytics and visualization. Xarray is heavily inspired by pandas and it uses pandas internally.
dask='allowed'
means that you're applying a function that already knows how to handle dask arrays, e.g., a function written in terms of dask.array
operations. In most cases, that does indeed mean that the function will operate on chunks one by one to lower memory usage, and will apply the computation in parallel.
dask='parallelized'
requires more information from the user because it creates its own wrapper that allows the provided function to act on dask arrays, by using low-level dask.array functions like atop
. With dask='parallelized'
, you can provide a function that only knows how to handle NumPy arrays, and xarray.apply_ufunc
will extend it to handle dask arrays, too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With