How to one-hot-encode from a pandas column containing a list?

Tags:

I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element existing in a row and 0 in the case of absence).

For example, taking dataframe df

Col1   Col2         Col3  C      33     [Apple, Orange, Banana]  A      2.5    [Apple, Grape]  B      42     [Banana]

I would like to convert this to:

Col1   Col2   Apple   Orange   Banana   Grape  C      33     1        1        1       0  A      2.5    1        0        0       1  B      42     0        0        1       0

How can I use pandas/sklearn to achieve this?

363

asked Jul 25 '17 19:07

Melsauce

1 Answers

We can also use sklearn.preprocessing.MultiLabelBinarizer:

Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.

Sparse solution (for Pandas v0.25.0+)

from sklearn.preprocessing import MultiLabelBinarizer  mlb = MultiLabelBinarizer(sparse_output=True)  df = df.join(             pd.DataFrame.sparse.from_spmatrix(                 mlb.fit_transform(df.pop('Col3')),                 index=df.index,                 columns=mlb.classes_))

result:

In [38]: df Out[38]:   Col1  Col2  Apple  Banana  Grape  Orange 0    C  33.0      1       1      0       1 1    A   2.5      1       0      1       0 2    B  42.0      0       1      0       0  In [39]: df.dtypes Out[39]: Col1                object Col2               float64 Apple     Sparse[int32, 0] Banana    Sparse[int32, 0] Grape     Sparse[int32, 0] Orange    Sparse[int32, 0] dtype: object  In [40]: df.memory_usage() Out[40]: Index     128 Col1       24 Col2       24 Apple      16    #  <--- NOTE! Banana     16    #  <--- NOTE! Grape       8    #  <--- NOTE! Orange      8    #  <--- NOTE! dtype: int64

Dense solution

mlb = MultiLabelBinarizer() df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),                           columns=mlb.classes_,                           index=df.index))

Result:

In [77]: df Out[77]:   Col1  Col2  Apple  Banana  Grape  Orange 0    C  33.0      1       1      0       1 1    A   2.5      1       0      1       0 2    B  42.0      0       1      0       0

200

answered Oct 07 '22 17:10

MaxU - stop WAR against UA

Related questions
                            
                                How do I concatenate a boolean to a string in Python?
                            
                                python time + timedelta equivalent
                            
                                Confused with python lists: are they or are they not iterators?
                            
                                How to get correlation of two vectors in python [duplicate]
                            
                                How do you plot a vertical line on a time series plot in Pandas?
                            
                                How to set self.maxDiff in nose to get full diff output?
                            
                                Replace first occurrence only of a string?
                            
                                python zipfile module doesn't seem to be compressing my files
                            
                                Python object deleting itself
                            
                                In Python NumPy what is a dimension and axis?
                            
                                SMTPAuthenticationError when sending mail using gmail and python [duplicate]
                            
                                How to strip comma in Python string
                            
                                python: urllib2 how to send cookie with urlopen request
                            
                                Does Python have anonymous classes?
                            
                                Django: signal when user logs in?
                            
                                What is StringIO in python used for in reality?
                            
                                start index at 1 for Pandas DataFrame
                            
                                Read file content from S3 bucket with boto3
                            
                                Overriding "+=" in Python? (__iadd__() method)
                            
                                timeit versus timing decorator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to one-hot-encode from a pandas column containing a list?

Tags:

python

pandas

numpy

scikit-learn

sklearn-pandas