Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why shouldn't the sklearn LabelEncoder be used to encode input data?

The docs for sklearn.LabelEncoder start with

This transformer should be used to encode target values, i.e. y, and not the input X.

Why is this?

I post just one example of this recommendation being ignored in practice, although there seems to be loads more. https://www.kaggle.com/matleonard/feature-generation contains

#(ks is the input data)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)
like image 542
hlud6646 Avatar asked Jan 25 '20 23:01

hlud6646


People also ask

How do you encode labels in sklearn?

Label Encoder: Label Encoding in Python can be achieved using Sklearn Library. Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels.

When to use labelencoder in machine learning?

Use LabelEncoder for label columns in case of supervised learning when it is binary classification problem. Don’t use LabelEncoder when the categorical features have more than two values. The nominal categorical features having more than two values may get treated as ordinal one by the machine learning model.

What is the use of label encoder in Python?

LabelEncoder encodes labels by assigning them numbers. Thus, if the feature is color with values such as [‘white’, ‘red’, ‘black’, ‘blue’]., using LabelEncoder may encode color string label as [0, 1, 2, 3].

Why label encoding is not used for categorical encoding for machine learning?

This is why Label Encoding is not very much used for categorical encoding for machine learning. One Hot Encoding is much suited to overcome the shortcoming of Label Encoding and is commonly used with machine learning algorithms. However, it also has some disadvantages.


2 Answers

Maybe because:

  1. It doesn't naturally work on multiple columns at once.
  2. It doesn't support ordering. I.e. if your categories are ordinal, such as:

Awful, Bad, Average, Good, Excellent

LabelEncoder would give them an arbitrary order (probably as they are encountered in the data), which will not help your classifier.

In this case you could use either an OrdinalEncoder or a manual replacement.

1. OrdinalEncoder:

Encode categorical features as an integer array.

df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']])  # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']])  # Can either fit on 1 feature, or multiple features at once.

Output:

array([[1.],
       [0.],
       [3.],
       [2.],
       [4.]])

Notice the logical order in the ouput.

2. Manual replacement:

scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)

Output:

0    1
1    0
2    3
3    2
4    4
Name: Quality, dtype: int64
like image 109
Alaa M. Avatar answered Nov 12 '22 17:11

Alaa M.


It is not that big of deal that it changes the output value y because it is only relearn based on that (if it a regression based on error).

The problem if it changes up the weights of the input values “X” that makes it impossible to do correct predictions.

You can do it on the X if there are not many options for example 2 category, 2 currency, 2 city encoded in to int-s does not changes the game too much.

like image 38
sogu Avatar answered Nov 12 '22 19:11

sogu