Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to elegantly one hot encode a series of lists in pandas [duplicate]

So I have the following data:

>>> test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
>>> test

0    [a, b, e]
1       [c, a]
2          [d]
3          [d]
4          [e]

I am trying to one-hot-encode all of the data in the lists back into my dataframe. To look like this:

>>> pd.DataFrame([[1, 1, 0, 0, 1], [1, 0, 1, 0, 0],
              [0, 0, 0, 1, 0], [0, 0, 0, 1, 0],
              [0, 0, 0, 0, 1]],
             columns = ['a', 'b', 'c', 'd', 'e'])

    a   b   c   d   e
0   1   1   0   0   1
1   1   0   1   0   0
2   0   0   0   1   0
3   0   0   0   1   0
4   0   0   0   0   1

I have tried researching and I've found similar problems but none like this. I have attempted:

test.apply(pd.Series)

But that doesn't quite accomplish the one-hot aspect. That simply unpacks my lists in an arbitrary order. I'm sure I could figure out a lengthly solution but I'd be glad to hear if there's a more elegant way to perform this.

Thanks!

EDIT: I am aware that I can iterate through my test series, then create a column for each unique value found, then go back and iterate through test again, flagging said columns for unique values. But that doesn't seem very pandorable to me and I'm sure there's a more elegant way to do this.

like image 269
Brian Avatar asked Sep 05 '18 15:09

Brian


People also ask

How to one-hot encode multiple columns in pandas?

In many cases, you’ll need to one-hot encode multiple columns and Pandas makes this very easy to do. By passing a DataFrame into the data= parameter and passing in a list of columns into the columns= parameter, you can easily one-hot encode multiple columns. Let’s see what this looks like:

What is one hot encoding in pandas?

Pandas — One Hot Encoding (OHE). Pandas Dataframe Examples: AI Secrets—… | by J3 | Jungletronics | Medium Hi, this post deals with make categorical data numerical in a Data set for application of machine learning algorithms. (Colab File link :) In machine learning one-hot encoding is a frequently used method to deal with categorical data.

How to encode character string in pandas series/index?

Pandas Series.str.encode () function is used to encode character string in the Series/Index using indicated encoding. Equivalent to str.encode (). Syntax: Series.str.encode (encoding, errors=’strict’)

How to know all options of a categorical data set in pandas?

In order to know all the options of a categorical data set, let’s use Pandas’ unique method, first in the sx column: Let’s do OHE, first in the sex column: fig 3. Dropping the first one: without-data lost:) By dropping the first column we did not lose any information, right?


Video Answer


1 Answers

MultiLabelBinarizer from the sklearn library is more efficient for these problems. It should be preferred to apply with pd.Series. Here's a demo:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(test),
                   columns=mlb.classes_,
                   index=test.index)

Result

   a  b  c  d  e
0  1  1  0  0  1
1  1  0  1  0  0
2  0  0  0  1  0
3  0  0  0  1  0
4  0  0  0  0  1
like image 199
jpp Avatar answered Oct 20 '22 23:10

jpp