Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

add a dask.array column to a dask.dataframe

I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed in their particular way.

df['col'] = da.col
# TypeError: Column assignment doesn't support type Array

df['col'] = da.to_frame(columns='col')
# TypeError: '<' not supported between instances of 'str' and 'int'

df['col'] = da.to_frame(columns=['col']).set_index(df.col).col
# TypeError: '<' not supported between instances of 'str' and 'int'

df = df.reset_index()
df['col'] = da.to_frame(columns='col')
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

and a few other variants.

What is the right way to add a dask array column to a dask dataframe when the structures are logically compatible?

like image 899
Daniel Mahler Avatar asked Jan 08 '18 21:01

Daniel Mahler


People also ask

How do I merge in Dask?

Dask Dataframe Merge You can join a Dask DataFrame to a small pandas DataFrame by using the dask. dataframe. merge() method, similar to the pandas api. To join two large Dask DataFrames, you can use the exact same Python syntax.

Is Dask faster than NumPy?

That's where Dask arrays provide much more flexibility than Numpy. They enable you to work with larger-than-memory objects, and computation time is significantly faster due to parallelization.

Is Dask DataFrame faster than pandas?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Is VAEX faster than Dask?

Vaex is flat out the fastest and most memory efficient Python DataFrame library out there. The fact that it can process 1.000. 000.000 (billion) rows per second on a single machine is unmatched in the industry today. It is made for interactive exploration, visualization, and preprocessing of large tabular datasets.


1 Answers

This does seem to work as of dask version 2021.4.0, and possibly earlier. Just make sure the number of dataframe partitions matches the number of array chunks.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
ddf = dd.from_pandas(pd.DataFrame({'z': np.arange(100, 104)}),
                     npartitions=2)
ddf['a'] = da.arange(200,204, chunks=2)
print(ddf.compute())

Output:

     z    a
0  100  200
1  101  201
2  102  202
3  103  203
like image 146
HoosierDaddy Avatar answered Sep 18 '22 08:09

HoosierDaddy