So I have the following data:
>>> test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
>>> test
0 [a, b, e]
1 [c, a]
2 [d]
3 [d]
4 [e]
I am trying to one-hot-encode all of the data in the lists back into my dataframe. To look like this:
>>> pd.DataFrame([[1, 1, 0, 0, 1], [1, 0, 1, 0, 0],
[0, 0, 0, 1, 0], [0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]],
columns = ['a', 'b', 'c', 'd', 'e'])
a b c d e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1
I have tried researching and I've found similar problems but none like this. I have attempted:
test.apply(pd.Series)
But that doesn't quite accomplish the one-hot aspect. That simply unpacks my lists in an arbitrary order. I'm sure I could figure out a lengthly solution but I'd be glad to hear if there's a more elegant way to perform this.
Thanks!
EDIT: I am aware that I can iterate through my test
series, then create a column for each unique value found, then go back and iterate through test
again, flagging said columns for unique values. But that doesn't seem very pandorable to me and I'm sure there's a more elegant way to do this.
In many cases, you’ll need to one-hot encode multiple columns and Pandas makes this very easy to do. By passing a DataFrame into the data= parameter and passing in a list of columns into the columns= parameter, you can easily one-hot encode multiple columns. Let’s see what this looks like:
Pandas — One Hot Encoding (OHE). Pandas Dataframe Examples: AI Secrets—… | by J3 | Jungletronics | Medium Hi, this post deals with make categorical data numerical in a Data set for application of machine learning algorithms. (Colab File link :) In machine learning one-hot encoding is a frequently used method to deal with categorical data.
Pandas Series.str.encode () function is used to encode character string in the Series/Index using indicated encoding. Equivalent to str.encode (). Syntax: Series.str.encode (encoding, errors=’strict’)
In order to know all the options of a categorical data set, let’s use Pandas’ unique method, first in the sx column: Let’s do OHE, first in the sex column: fig 3. Dropping the first one: without-data lost:) By dropping the first column we did not lose any information, right?
MultiLabelBinarizer
from the sklearn
library is more efficient for these problems. It should be preferred to apply
with pd.Series
. Here's a demo:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(test),
columns=mlb.classes_,
index=test.index)
Result
a b c d e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With