I have a folder of parquet files that I can't fit in memory so I am using dask
to perform the data cleansing operations. I have a function where I want to perform item assignment but I can't seem to find any solutions online that qualify as solutions to this particular function. Below is the function that works in pandas. How do I get the same results in a dask dataframe? I thought delayed might help but all of the solutions I've tried to write haven't been working.
def item_assignment(df):
new_col = np.bitwise_and(df['OtherCol'], 0b110)
df['NewCol'] = 0
df.loc[new_col == 0b010, 'NewCol'] = 1
df.loc[new_col == 0b100, 'NewCol'] = -1
return df
TypeError: '_LocIndexer' object does not support item assignment
You can replace your loc
assignments with dask.dataframe.Series.mask
:
df['NewCol'] = 0
df['NewCol'] = df['NewCol'].mask(new_col == 0b010, 1)
df['NewCol'] = df['NewCol'].mask(new_col == 0b100, -1)
You can use map_partitions
in this case where you can use raw pandas functionality. I.e.
ddf.map_partitions(item_assignment)
this operates on the individual pandas constituent dataframes of the dask dataframe
df = pd.DataFrame({"OtherCol":[0b010, 0b110, 0b100, 0b110, 0b100, 0b010]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(item_assignment).compute()
And we see the result as expected:
OtherCol NewCol
0 2 1
1 6 0
2 4 -1
3 6 0
4 4 -1
5 2 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With