I have one column in CSV file. Each cell in the column has multiple values in a list. For e.g. one cell would contain ['A', 'B', 'C'] and the other ['B', 'D']. I want to apply one-hot encoding to this column to convert to binary values to use for machine learning.
Please let me know how I can do that?
Input is csv file, so there are no lists but strings. So remove [] and use Series.str.get_dummies along with removing trailing ' in column names:
df = df['col'].str.strip('[]').str.get_dummies(', ')
df.columns = df.columns.str.strip("'")
If there is some processing required to convert strings to lists use MultiLabelBinarizer for improved performance:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['col']),columns=mlb.classes_)
print (df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With