The docs for sklearn.LabelEncoder start with
This transformer should be used to encode target values, i.e. y, and not the input X.
Why is this?
I post just one example of this recommendation being ignored in practice, although there seems to be loads more. https://www.kaggle.com/matleonard/feature-generation contains
#(ks is the input data)
# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)
Label Encoder: Label Encoding in Python can be achieved using Sklearn Library. Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels.
Use LabelEncoder for label columns in case of supervised learning when it is binary classification problem. Don’t use LabelEncoder when the categorical features have more than two values. The nominal categorical features having more than two values may get treated as ordinal one by the machine learning model.
LabelEncoder encodes labels by assigning them numbers. Thus, if the feature is color with values such as [‘white’, ‘red’, ‘black’, ‘blue’]., using LabelEncoder may encode color string label as [0, 1, 2, 3].
This is why Label Encoding is not very much used for categorical encoding for machine learning. One Hot Encoding is much suited to overcome the shortcoming of Label Encoding and is commonly used with machine learning algorithms. However, it also has some disadvantages.
Maybe because:
Awful, Bad, Average, Good, Excellent
LabelEncoder
would give them an arbitrary order (probably as they are encountered in the data), which will not help your classifier.
In this case you could use either an OrdinalEncoder
or a manual replacement.
Encode categorical features as an integer array.
df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']]) # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']]) # Can either fit on 1 feature, or multiple features at once.
Output:
array([[1.],
[0.],
[3.],
[2.],
[4.]])
Notice the logical order in the ouput.
scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)
Output:
0 1
1 0
2 3
3 2
4 4
Name: Quality, dtype: int64
It is not that big of deal that it changes the output value y because it is only relearn based on that (if it a regression based on error).
The problem if it changes up the weights of the input values “X” that makes it impossible to do correct predictions.
You can do it on the X if there are not many options for example 2 category, 2 currency, 2 city encoded in to int-s does not changes the game too much.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With