Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:

class LabelOneHotEncoder():
    def __init__(self):
        self.ohe = OneHotEncoder()
        self.le = LabelEncoder()
    def fit_transform(self, x):
        features = self.le.fit_transform( x)
        return self.ohe.fit_transform( features.reshape(-1,1))
    def transform( self, x):
        return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
    def inverse_tranform( self, x):
        return self.le.inverse_transform( self.ohe.inverse_tranform( x))
    def inverse_labels( self, x):
        return self.le.inverse_transform( x)

I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:

LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])

I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.

like image 707
Learning is a mess Avatar asked Feb 22 '18 13:02

Learning is a mess


People also ask

How do you do one hot encoding in SciKit learn?

# One-hot encoding a single column from sklearn. preprocessing import OneHotEncoder from seaborn import load_dataset df = load_dataset('penguins') ohe = OneHotEncoder() transformed = ohe. fit_transform(df[['island']]) print(transformed. toarray()) # Returns: # [[0.

What is the difference between OneHotEncoder and LabelEncoder?

As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.

How does LabelEncoder work Sklearn?

Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.


2 Answers

It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.

class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)

Then using the pipeline should work.

pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])

https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39

like image 120
David Stevens Avatar answered Sep 30 '22 21:09

David Stevens


From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.

like image 25
bryant1410 Avatar answered Sep 30 '22 21:09

bryant1410