Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I create a dask array with a delayed shape

Tags:

dask

Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?

My algorithm won't give me the shape of the array until pretty late in the computation.

Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)

I don't think it is too detrimental if I can't, but it would be cool if could.

Sample code

from dask import delayed
from dask import array as da
import numpy as np

n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)


# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
like image 574
hmaarrfk Avatar asked Jul 02 '18 22:07

hmaarrfk


People also ask

How does Dask delayed work?

The Dask delayed function decorates your functions so that they operate lazily. Rather than executing your function immediately, it will defer execution, placing the function and its arguments into a task graph. Wraps a function or object to produce a Delayed .

How does Dask delayed help in data processing?

Instead of executing a function for each item in the loop in a sequential manner, Dask Delayed allows multiple items to be processed in parallel. With Dask Delayed each function call is queued, added to an execution graph and scheduled.

Why does Dask Compute take so long?

The reason dask dataframe is taking more time to compute (shape or any operation) is because when a compute op is called, dask tries to perform operations from the creation of the current dataframe or it's ancestors to the point where compute() is called.

What are chunks Dask?

Dask arrays are composed of many NumPy (or NumPy-like) arrays. How these arrays are arranged can significantly affect performance.


1 Answers

You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension

Example

import random
import numpy as np
import dask
import dask.array as da

@dask.delayed
def f():
    return np.ones((5, random.randint(10, 20)))  # a 5 x ? array

values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)

>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>

>>> x.shape
(5, np.nan)

>>> x.compute().shape
(5, 88)

Docs

See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks

like image 186
MRocklin Avatar answered Sep 24 '22 13:09

MRocklin