Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn.LabelEncoder with never seen before values

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

# train and test are pandas.DataFrame's and c is whatever column le = LabelEncoder() le.fit(train[c]) test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s) le.classes_ = np.append(le.classes_, '<unknown>') train[c] = le.transform(train[c]) test[c] = le.transform(test[c]) 

This works, but is there a better solution?

Update

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

import bisect le_classes = le.classes_.tolist() bisect.insort_left(le_classes, '<unknown>') le.classes_ = le_classes 

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

like image 426
cjauvin Avatar asked Jan 11 '14 01:01

cjauvin


People also ask

What is job of the function LabelEncoder () with Sklearn preprocessing?

LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

How does LabelEncoder work Sklearn?

Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.


1 Answers

LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

from sklearn.preprocessing import LabelEncoder  le = preprocessing.LabelEncoder() le.fit(X)  le_dict = dict(zip(le.classes_, le.transform(le.classes_))) 

Retrieve label for a single new item, if item is missing then set value as unknown

le_dict.get(new_item, '<Unknown>') 

Retrieve labels for a Dataframe column:

df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>)) 
like image 135
Rani Avatar answered Sep 25 '22 13:09

Rani