<p>Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to <code>values</code> with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation.</p>

<h3>Edit: yes, now this is trivial</h3> <p>You can use the <code>.values</code> property</p> <pre class="prettyprint"><code>x = df.values </code></pre> <h3>Older, now incorrect answer</h3> <p>At the moment there is no trivial way to do this. This is because dask.array needs to know the length of all of its chunks and dask.dataframe doesn't know this length. This can not be a completely lazy operation.</p> <p>That being said, you can accomplish it using dask.delayed as follows:</p> <pre class="prettyprint"><code>import dask.array as da from dask import compute def to_dask_array(df): partitions = df.to_delayed() shapes = [part.values.shape for part in partitions] dtype = partitions[0].dtype results = compute(dtype, *shapes) # trigger computation to find shape dtype, shapes = results[0], results[1:] chunks = [da.from_delayed(part.values, shape, dtype) for part, shape in zip(partitions, shapes)] return da.concatenate(chunks, axis=0) </code></pre>

Dask Array from DataFrame

Tags:

dask

Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to values with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation.

225

asked May 25 '16 18:05

Paul English

1 Answers

Edit: yes, now this is trivial

You can use the .values property

x = df.values

Older, now incorrect answer

At the moment there is no trivial way to do this. This is because dask.array needs to know the length of all of its chunks and dask.dataframe doesn't know this length. This can not be a completely lazy operation.

That being said, you can accomplish it using dask.delayed as follows:

import dask.array as da
from dask import compute

def to_dask_array(df):
    partitions = df.to_delayed()
    shapes = [part.values.shape for part in partitions]
    dtype = partitions[0].dtype

    results = compute(dtype, *shapes)  # trigger computation to find shape
    dtype, shapes = results[0], results[1:]

    chunks = [da.from_delayed(part.values, shape, dtype) 
              for part, shape in zip(partitions, shapes)]
    return da.concatenate(chunks, axis=0)

105

answered Oct 06 '22 18:10

MRocklin

Related questions
                            
                                Repartition Dask DataFrame to get even partitions
                            
                                duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe
                            
                                Applying a function along an axis of a dask array
                            
                                Remove empty partitions in Dask
                            
                                Difference between dask.distributed LocalCluster with threads vs. processes
                            
                                Only a column name can be used for the key in a dtype mappings argument
                            
                                dask: specify number of processes
                            
                                TypeError: can't pickle _thread._local objects when using dask on pandas DataFrame
                            
                                Dask item assignment. Cannot use loc for item assignment
                            
                                Why is a computation much slower within a Dask/Distributed worker?
                            
                                How to quickly compare two text files and get unique rows?
                            
                                simple dask map_partitions example
                            
                                Ways to handle exceptions in Dask distributed
                            
                                ValueError: Not all divisions are known, can't align partitions error on dask dataframe
                            
                                Dask read_csv fails where pandas doesn't
                            
                                Item assignment to Python dask array objects
                            
                                Sorting in Dask
                            
                                Dask: Drop NAs on columns?
                            
                                How do I change rows and columns in a dask dataframe?
                            
                                How to call unique() on dask DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With