Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn LabelBinarizer returns vector when there are 2 classes

The following code:

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit_transform(['yes', 'no', 'no', 'yes'])

returns:

array([[1],
       [0],
       [0],
       [1]])

However, I would like for there to be one column per class:

array([[1, 0],
       [0, 1],
       [0, 1],
       [1, 0]])

(I need the data in this format so I can give it to a neural network that uses the softmax function at the output layer)

When there are more than 2 classes, LabelBinarizer behaves as desired:

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit_transform(['yes', 'no', 'no', 'yes', 'maybe'])

returns

array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

Above, there is 1 column per class.

Is there any simple way to achieve the same (1 column per class) when there are 2 classes?

Edit: Based on yangjie's answer I wrote a class to wrap LabelBinarizer to produce the desired behavior described above: http://pastebin.com/UEL2dP62

import numpy as np
from sklearn.preprocessing import LabelBinarizer


class LabelBinarizer2:

    def __init__(self):
        self.lb = LabelBinarizer()

    def fit(self, X):
        # Convert X to array
        X = np.array(X)
        # Fit X using the LabelBinarizer object
        self.lb.fit(X)
        # Save the classes
        self.classes_ = self.lb.classes_

    def fit_transform(self, X):
        # Convert X to array
        X = np.array(X)
        # Fit + transform X using the LabelBinarizer object
        Xlb = self.lb.fit_transform(X)
        # Save the classes
        self.classes_ = self.lb.classes_
        if len(self.classes_) == 2:
            Xlb = np.hstack((Xlb, 1 - Xlb))
        return Xlb

    def transform(self, X):
        # Convert X to array
        X = np.array(X)
        # Transform X using the LabelBinarizer object
        Xlb = self.lb.transform(X)
        if len(self.classes_) == 2:
            Xlb = np.hstack((Xlb, 1 - Xlb))
        return Xlb

    def inverse_transform(self, Xlb):
        # Convert Xlb to array
        Xlb = np.array(Xlb)
        if len(self.classes_) == 2:
            X = self.lb.inverse_transform(Xlb[:, 0])
        else:
            X = self.lb.inverse_transform(Xlb)
        return X

Edit 2: It turns out yangjie has also written a new version of LabelBinarizer, awesome!

like image 545
applecider Avatar asked Aug 11 '15 16:08

applecider


People also ask

What does a LabelBinarizer () function do?

LabelBinarizer makes this process easy with the transform method. At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform method. Read more in the User Guide.

What is multi label Binarizer?

Multilabelbinarizer allows you to encode multiple labels per instance. To translate the resulting array, you could build a DataFrame with this array and the encoded classes (through its "classes_" attribute). binarizer = MultiLabelBinarizer() pd.DataFrame(binarizer.fit_transform(y), columns=binarizer.classes_)

What is LabelBinarizer in machine learning?

Label Binarizer is an SciKit Learn class that accepts Categorical data as input and returns an Numpy array.


1 Answers

I think there is no direct way to do it especially if you want to have inverse_transform.

But you can use numpy to construct the label easily

In [18]: import numpy as np

In [19]: from sklearn.preprocessing import LabelBinarizer

In [20]: lb = LabelBinarizer()

In [21]: label = lb.fit_transform(['yes', 'no', 'no', 'yes'])

In [22]: label = np.hstack((label, 1 - label))

In [23]: label
Out[23]:
array([[1, 0],
       [0, 1],
       [0, 1],
       [1, 0]])

Then you can use inverse_transform by slicing the first column

In [24]: lb.inverse_transform(label[:, 0])
Out[24]:
array(['yes', 'no', 'no', 'yes'],
      dtype='<U3')

Based on the above solution, you can write a class that inherits LabelBinarizer, which makes the operations and results consistent for both binary and multiclass case.

from sklearn.preprocessing import LabelBinarizer
import numpy as np

class MyLabelBinarizer(LabelBinarizer):
    def transform(self, y):
        Y = super().transform(y)
        if self.y_type_ == 'binary':
            return np.hstack((Y, 1-Y))
        else:
            return Y

    def inverse_transform(self, Y, threshold=None):
        if self.y_type_ == 'binary':
            return super().inverse_transform(Y[:, 0], threshold)
        else:
            return super().inverse_transform(Y, threshold)

Then

lb = MyLabelBinarizer()
label1 = lb.fit_transform(['yes', 'no', 'no', 'yes'])
print(label1)
print(lb.inverse_transform(label1))
label2 = lb.fit_transform(['yes', 'no', 'no', 'yes', 'maybe'])
print(label2)
print(lb.inverse_transform(label2))

gives

[[1 0]
 [0 1]
 [0 1]
 [1 0]]
['yes' 'no' 'no' 'yes']
[[0 0 1]
 [0 1 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]]
['yes' 'no' 'no' 'yes' 'maybe']
like image 166
yangjie Avatar answered Sep 19 '22 12:09

yangjie