Dask get_dummies splitting

Question

I'm in the process of migrating my pandas operations to dask. When I was using pandas the following line worked succesfully: triggers = df.triggers.str.get_dummies(','). It split the string at the commas before taking them to be dummy variables.

For example if df.triggers had three rows as such:

["a, b, c", 
 "a", 
 "b, c"]

this would output the values:

a | b | c
1 | 1 | 1
1 | 0 | 0
0 | 1 | 1

However, I cannot use the same command in dask and get the error AttributeError: get_dummies. When I try to use dd.get_dummies instead it asks me to categorize the strings. However, each string only becomes a string after splitting by commas.

Any thoughts on how to get around this?

Weston A. Greene · Accepted Answer

Dask is lazy in its computations. Therefore, dask doesn't know all the unique values within a column until after computation. Therefore, dask requires columns to be categorized before one-hot encoding (i.e. dd.get_dummies).

str.split must execute first, then .get_dummies can execute upon the new columns, then the newly encoded columns can join the original DF.

This is how I solved the problem:

data = {
    'col_a': ['a,1,2,3', 'b,1,', 'c', 'd,1,2'],
    'other_columns': ['a', 'b', 'c', 'd'],
}
df = dd.from_dict(data, npartitions=2)  # I made sure to choose a partition count lower than my row number.

col_name_to_split = 'col_a'

def split_col(df: pd.DataFrame, col: str) -> pd.DataFrame:
    tmp_df = df[col].str.split(',', expand=True)
    df = df.drop(columns=[col])
    tmp_df.columns = [f'{col}__{x}' for x in tmp_df.columns]  # dunderscore to distinguish when dropping later and not mix with encoded integers.
    anticipated_splits = 4  # Dask will not know how many splits will result from mapping this function (dask can only see the values within the current partition it is iterating over), so you must inform Dask in advance (similar to supplying `meta`).
    for col1 in [f'{col}__{num}' for num in range(anticipated_splits)]:
        tmp_df[col1] = tmp_df.get(col1, float('nan'))  # Fill in with blank if this partition happens not to have all the newly split columns
    df = df.join(tmp_df)
    return df

df = df.map_partitions(split_col, col_name_to_split)


split_cols = [col for col in df.columns if f"{col_name_to_split}__" in col]
df = df.categorize(split_cols)  # This will trigger all queued computations so that dask will know how many dummy columns to make.
tmp_dfs = {}
for col in split_cols:
    tmp_df = dd.get_dummies(df[col], prefix=col_name_to_split)
    tmp_df = tmp_df.map(lambda v: float('nan') if not v else v)  # 
        # So that `.combine_first` doesn't overwrite existing `True`s with incoming `False`s.
        # Caveat: the one hot columns render as `1.0` and as `True` in different columns. Dunno why. But these values are treated as equivalent
    df = df.combine_first(tmp_df)
    tmp_dfs[col] = tmp_df
df = df.drop(columns=split_cols)

  col_a_ col_a_1 col_a_2 col_a_3 col_a_a col_a_b col_a_c col_a_d other_columns
0    NaN    True    True    True    True     NaN     NaN     NaN             a
1   True    True     NaN     NaN     NaN    True     NaN     NaN             b
2    NaN     NaN     NaN     NaN     NaN     NaN    True     NaN             c
3    NaN    True    True     NaN     NaN     NaN     NaN    True             d

Dask get_dummies splitting

Tags:

python

pandas

dask

sachinruk

1 Answers

Weston A. Greene

Recent Activity

Donate For Us

Dask get_dummies splitting

Tags:

python

pandas

dask

sachinruk

1 Answers

Weston A. Greene

Related questions

Recent Activity

Donate For Us