scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

Tags:

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:

class LabelOneHotEncoder():
    def __init__(self):
        self.ohe = OneHotEncoder()
        self.le = LabelEncoder()
    def fit_transform(self, x):
        features = self.le.fit_transform( x)
        return self.ohe.fit_transform( features.reshape(-1,1))
    def transform( self, x):
        return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
    def inverse_tranform( self, x):
        return self.le.inverse_transform( self.ohe.inverse_tranform( x))
    def inverse_labels( self, x):
        return self.le.inverse_transform( x)

I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:

LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])

I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.

707

asked Feb 22 '18 13:02

Learning is a mess

2 Answers

It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.

class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)

Then using the pipeline should work.

pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])

https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39

120

answered Sep 30 '22 21:09

David Stevens

From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.

answered Sep 30 '22 21:09

bryant1410

Related questions
                            
                                NaN in python and validity checking [duplicate]
                            
                                How to zip a string?
                            
                                Can django-tastypie display a different set of fields in the list and detail views of a single resource?
                            
                                Divide the values of two dictionaries in python
                            
                                httplib.InvalidURL: nonnumeric port:
                            
                                Flask: How to manage different environment databases?
                            
                                Change saturation with Imagekit, PIL or Pillow?
                            
                                django - catch multiple exceptions
                            
                                convert xml to python dict
                            
                                read whole file at once
                            
                                kosaraju finding finishing time using iterative dfs
                            
                                How to open a list of files in Python
                            
                                How do I add more python modules to my yocto/openembedded project?
                            
                                What does the "variable //= a value" syntax mean in Python? [duplicate]
                            
                                plt.plot meaning of [:,0] and [:,1] [duplicate]
                            
                                Parallelizing four nested loops in Python
                            
                                Pycharm Enter Key is not working
                            
                                Unable to access ElasticSearch AWS through Python
                            
                                Position the legend outside the plot area with Bokeh
                            
                                Get the largest connected component of segmentation image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

Tags:

python

one-hot-encoding

scikit-learn

Learning is a mess

People also ask

2 Answers

David Stevens

bryant1410

Recent Activity

Donate For Us