I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed in their particular way.
df['col'] = da.col
# TypeError: Column assignment doesn't support type Array
df['col'] = da.to_frame(columns='col')
# TypeError: '<' not supported between instances of 'str' and 'int'
df['col'] = da.to_frame(columns=['col']).set_index(df.col).col
# TypeError: '<' not supported between instances of 'str' and 'int'
df = df.reset_index()
df['col'] = da.to_frame(columns='col')
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
and a few other variants.
What is the right way to add a dask array column to a dask dataframe when the structures are logically compatible?
Dask Dataframe Merge You can join a Dask DataFrame to a small pandas DataFrame by using the dask. dataframe. merge() method, similar to the pandas api. To join two large Dask DataFrames, you can use the exact same Python syntax.
That's where Dask arrays provide much more flexibility than Numpy. They enable you to work with larger-than-memory objects, and computation time is significantly faster due to parallelization.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
Vaex is flat out the fastest and most memory efficient Python DataFrame library out there. The fact that it can process 1.000. 000.000 (billion) rows per second on a single machine is unmatched in the industry today. It is made for interactive exploration, visualization, and preprocessing of large tabular datasets.
This does seem to work as of dask version 2021.4.0
, and possibly earlier. Just make sure the number of dataframe partitions matches the number of array chunks.
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
ddf = dd.from_pandas(pd.DataFrame({'z': np.arange(100, 104)}),
npartitions=2)
ddf['a'] = da.arange(200,204, chunks=2)
print(ddf.compute())
Output:
z a
0 100 200
1 101 201
2 102 202
3 103 203
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With