I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode
them (with value 1
representing a given element existing in a row and 0
in the case of absence).
For example, taking dataframe df
Col1 Col2 Col3 C 33 [Apple, Orange, Banana] A 2.5 [Apple, Grape] B 42 [Banana]
I would like to convert this to:
df
Col1 Col2 Apple Orange Banana Grape C 33 1 1 1 0 A 2.5 1 0 0 1 B 42 0 0 1 0
How can I use pandas/sklearn to achieve this?
You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.
To slice the columns, the syntax is df. loc[:,start:stop:step] ; where start is the name of the first column to take, stop is the name of the last column to take, and step as the number of indices to advance after each extraction; for example, you can select alternate columns.
values. tolist() you can convert pandas DataFrame Column to List. df['Courses'] returns the DataFrame column as a Series and then use values. tolist() to convert the column values to list.
We can also use sklearn.preprocessing.MultiLabelBinarizer:
Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.
from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(sparse_output=True) df = df.join( pd.DataFrame.sparse.from_spmatrix( mlb.fit_transform(df.pop('Col3')), index=df.index, columns=mlb.classes_))
result:
In [38]: df Out[38]: Col1 Col2 Apple Banana Grape Orange 0 C 33.0 1 1 0 1 1 A 2.5 1 0 1 0 2 B 42.0 0 1 0 0 In [39]: df.dtypes Out[39]: Col1 object Col2 float64 Apple Sparse[int32, 0] Banana Sparse[int32, 0] Grape Sparse[int32, 0] Orange Sparse[int32, 0] dtype: object In [40]: df.memory_usage() Out[40]: Index 128 Col1 24 Col2 24 Apple 16 # <--- NOTE! Banana 16 # <--- NOTE! Grape 8 # <--- NOTE! Orange 8 # <--- NOTE! dtype: int64
mlb = MultiLabelBinarizer() df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')), columns=mlb.classes_, index=df.index))
Result:
In [77]: df Out[77]: Col1 Col2 Apple Banana Grape Orange 0 C 33.0 1 1 0 1 1 A 2.5 1 0 1 0 2 B 42.0 0 1 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With