While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder
from sklearn.preprocessing
or to_categorical
from kera.np_utils
require int
inputs. This means that I need to precede the one hot encoder with a LabelEncoder
. I have done it by hand with a custom class:
class LabelOneHotEncoder():
def __init__(self):
self.ohe = OneHotEncoder()
self.le = LabelEncoder()
def fit_transform(self, x):
features = self.le.fit_transform( x)
return self.ohe.fit_transform( features.reshape(-1,1))
def transform( self, x):
return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
def inverse_tranform( self, x):
return self.le.inverse_transform( self.ohe.inverse_tranform( x))
def inverse_labels( self, x):
return self.le.inverse_transform( x)
I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline
, but when using:
LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])
I get the error ValueError: bad input shape ()
from the OneHotEncoder
. My guess is that the output of the LabelEncoder
needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.
# One-hot encoding a single column from sklearn. preprocessing import OneHotEncoder from seaborn import load_dataset df = load_dataset('penguins') ohe = OneHotEncoder() transformed = ohe. fit_transform(df[['island']]) print(transformed. toarray()) # Returns: # [[0.
As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.
Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.
It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.
class ModifiedLabelEncoder(LabelEncoder):
def fit_transform(self, y, *args, **kwargs):
return super().fit_transform(y).reshape(-1, 1)
def transform(self, y, *args, **kwargs):
return super().transform(y).reshape(-1, 1)
Then using the pipeline should work.
pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39
From scikit-learn 0.20, OneHotEncoder
accepts strings, so you don't need a LabelEncoder
before it anymore. And you can just use it in a pipeline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With