Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to one-hot-encode from a pandas column containing a list?

I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element existing in a row and 0 in the case of absence).

For example, taking dataframe df

Col1   Col2         Col3  C      33     [Apple, Orange, Banana]  A      2.5    [Apple, Grape]  B      42     [Banana]  

I would like to convert this to:

df

Col1   Col2   Apple   Orange   Banana   Grape  C      33     1        1        1       0  A      2.5    1        0        0       1  B      42     0        0        1       0 

How can I use pandas/sklearn to achieve this?

like image 363
Melsauce Avatar asked Jul 25 '17 19:07

Melsauce


People also ask

How do I pull data from a column in Pandas?

You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.

How do you slice a DataFrame by column names?

To slice the columns, the syntax is df. loc[:,start:stop:step] ; where start is the name of the first column to take, stop is the name of the last column to take, and step as the number of indices to advance after each extraction; for example, you can select alternate columns.

How do I get a list of values from a DataFrame column?

values. tolist() you can convert pandas DataFrame Column to List. df['Courses'] returns the DataFrame column as a Series and then use values. tolist() to convert the column values to list.


1 Answers

We can also use sklearn.preprocessing.MultiLabelBinarizer:

Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.

Sparse solution (for Pandas v0.25.0+)

from sklearn.preprocessing import MultiLabelBinarizer  mlb = MultiLabelBinarizer(sparse_output=True)  df = df.join(             pd.DataFrame.sparse.from_spmatrix(                 mlb.fit_transform(df.pop('Col3')),                 index=df.index,                 columns=mlb.classes_)) 

result:

In [38]: df Out[38]:   Col1  Col2  Apple  Banana  Grape  Orange 0    C  33.0      1       1      0       1 1    A   2.5      1       0      1       0 2    B  42.0      0       1      0       0  In [39]: df.dtypes Out[39]: Col1                object Col2               float64 Apple     Sparse[int32, 0] Banana    Sparse[int32, 0] Grape     Sparse[int32, 0] Orange    Sparse[int32, 0] dtype: object  In [40]: df.memory_usage() Out[40]: Index     128 Col1       24 Col2       24 Apple      16    #  <--- NOTE! Banana     16    #  <--- NOTE! Grape       8    #  <--- NOTE! Orange      8    #  <--- NOTE! dtype: int64 

Dense solution

mlb = MultiLabelBinarizer() df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),                           columns=mlb.classes_,                           index=df.index)) 

Result:

In [77]: df Out[77]:   Col1  Col2  Apple  Banana  Grape  Orange 0    C  33.0      1       1      0       1 1    A   2.5      1       0      1       0 2    B  42.0      0       1      0       0 

like image 200
MaxU - stop WAR against UA Avatar answered Oct 07 '22 17:10

MaxU - stop WAR against UA