Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask Array from DataFrame

Tags:

dask

Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to values with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation.

like image 225
Paul English Avatar asked May 25 '16 18:05

Paul English


People also ask

Is Dask faster than NumPy?

That's where Dask arrays provide much more flexibility than Numpy. They enable you to work with larger-than-memory objects, and computation time is significantly faster due to parallelization.

How do I index a Dask DataFrame?

set_index syntaxCreate a pandas DataFrame with two columns of data, and a 2-partition Dask DataFrame from it. Print the DataFrame and see that it has one index column that was created by default by pandas and two columns with data. Take a look at the divisions of ddf. ddf has two divisions.

Is Dask faster than pandas?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.


1 Answers

Edit: yes, now this is trivial

You can use the .values property

x = df.values

Older, now incorrect answer

At the moment there is no trivial way to do this. This is because dask.array needs to know the length of all of its chunks and dask.dataframe doesn't know this length. This can not be a completely lazy operation.

That being said, you can accomplish it using dask.delayed as follows:

import dask.array as da
from dask import compute

def to_dask_array(df):
    partitions = df.to_delayed()
    shapes = [part.values.shape for part in partitions]
    dtype = partitions[0].dtype

    results = compute(dtype, *shapes)  # trigger computation to find shape
    dtype, shapes = results[0], results[1:]

    chunks = [da.from_delayed(part.values, shape, dtype) 
              for part, shape in zip(partitions, shapes)]
    return da.concatenate(chunks, axis=0)
like image 105
MRocklin Avatar answered Oct 06 '22 18:10

MRocklin