I'm in the process of migrating my pandas operations to dask. When I was using pandas the following line worked succesfully:
triggers = df.triggers.str.get_dummies(','). It split the string at the commas before taking them to be dummy variables.
For example if df.triggers had three rows as such:
["a, b, c",
"a",
"b, c"]
this would output the values:
a | b | c
1 | 1 | 1
1 | 0 | 0
0 | 1 | 1
However, I cannot use the same command in dask and get the error AttributeError: get_dummies. When I try to use dd.get_dummies instead it asks me to categorize the strings. However, each string only becomes a string after splitting by commas.
Any thoughts on how to get around this?
Dask is lazy in its computations. Therefore, dask doesn't know all the unique values within a column until after computation. Therefore, dask requires columns to be categorized before one-hot encoding (i.e. dd.get_dummies).
str.split must execute first, then .get_dummies can execute upon the new columns, then the newly encoded columns can join the original DF.
This is how I solved the problem:
data = {
'col_a': ['a,1,2,3', 'b,1,', 'c', 'd,1,2'],
'other_columns': ['a', 'b', 'c', 'd'],
}
df = dd.from_dict(data, npartitions=2) # I made sure to choose a partition count lower than my row number.
col_name_to_split = 'col_a'
def split_col(df: pd.DataFrame, col: str) -> pd.DataFrame:
tmp_df = df[col].str.split(',', expand=True)
df = df.drop(columns=[col])
tmp_df.columns = [f'{col}__{x}' for x in tmp_df.columns] # dunderscore to distinguish when dropping later and not mix with encoded integers.
anticipated_splits = 4 # Dask will not know how many splits will result from mapping this function (dask can only see the values within the current partition it is iterating over), so you must inform Dask in advance (similar to supplying `meta`).
for col1 in [f'{col}__{num}' for num in range(anticipated_splits)]:
tmp_df[col1] = tmp_df.get(col1, float('nan')) # Fill in with blank if this partition happens not to have all the newly split columns
df = df.join(tmp_df)
return df
df = df.map_partitions(split_col, col_name_to_split)
split_cols = [col for col in df.columns if f"{col_name_to_split}__" in col]
df = df.categorize(split_cols) # This will trigger all queued computations so that dask will know how many dummy columns to make.
tmp_dfs = {}
for col in split_cols:
tmp_df = dd.get_dummies(df[col], prefix=col_name_to_split)
tmp_df = tmp_df.map(lambda v: float('nan') if not v else v) #
# So that `.combine_first` doesn't overwrite existing `True`s with incoming `False`s.
# Caveat: the one hot columns render as `1.0` and as `True` in different columns. Dunno why. But these values are treated as equivalent
df = df.combine_first(tmp_df)
tmp_dfs[col] = tmp_df
df = df.drop(columns=split_cols)
col_a_ col_a_1 col_a_2 col_a_3 col_a_a col_a_b col_a_c col_a_d other_columns
0 NaN True True True True NaN NaN NaN a
1 True True NaN NaN NaN True NaN NaN b
2 NaN NaN NaN NaN NaN NaN True NaN c
3 NaN True True NaN NaN NaN NaN True d
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With