I am trying to use MultiLabelBinarizer
in sklearn. I have a pandas series and I want to feed that series as input to MultiLabelBinarizer
's fit function. However, I see that MultiLabelBinarizer's fit needs an input of form iterable of iterables
. I am not sure how can I convert pandas series to required type.
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
data = pd.read_csv("somecsvFile")
y = pd.DataFrame(data['class'])
mlb = MultiLabelBinarizer()
y = mlb.fit(???)
I tried converting it to numpy array, tried using iter function of pandas, but nothing seems to be working.
Please suggest me some way.
Thanks
Edit1: Output of print(data['class'].head(10))
is:
0 func
1 func
2 func
3 non func
4 func
5 func
6 non func
7 non func
8 non func
9 func
Name: status_group, dtype: object
How to workaround the fact that MultiLabelBinarizer's fit needs an input of form iterable of iterables
:
In [8]: df
Out[8]:
class
0 func
1 func
2 func
3 non func
4 func
5 func
6 non func
7 non func
8 non func
9 func
In [10]: import pandas as pd
...: from sklearn.preprocessing import MultiLabelBinarizer
In [11]: y = df['class'].str.split(expand=False) # <--- NOTE !!!
In [12]: mlb = MultiLabelBinarizer()
...: y = mlb.fit_transform(y)
...:
In [13]: y
Out[13]:
array([[1, 0],
[1, 0],
[1, 0],
[1, 1],
[1, 0],
[1, 0],
[1, 1],
[1, 1],
[1, 1],
[1, 0]])
UPDATE: as proposed by @unutbu you can use pd.get_dummies()
In [21]: pd.get_dummies(df['class'])
Out[21]:
func non func
0 1 0
1 1 0
2 1 0
3 0 1
4 1 0
5 1 0
6 0 1
7 0 1
8 0 1
9 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With