If a sklearn.LabelEncoder
has been fitted on a training set, it might break if it encounters new values when used on a test set.
The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>"
, and then explicitly add a corresponding class to the LabelEncoder
afterward:
# train and test are pandas.DataFrame's and c is whatever column le = LabelEncoder() le.fit(train[c]) test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s) le.classes_ = np.append(le.classes_, '<unknown>') train[c] = le.transform(train[c]) test[c] = le.transform(test[c])
This works, but is there a better solution?
Update
As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform
, which now seems to use np.searchsorted
(I don't know if it was the case before). So instead of appending the <unknown>
class to the LabelEncoder
's list of already extracted classes, it needs to be inserted in sorted order:
import bisect le_classes = le.classes_.tolist() bisect.insort_left(le_classes, '<unknown>') le.classes_ = le_classes
However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.
LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.
LabelEncoder is basically a dictionary. You can extract and use it for future encoding:
from sklearn.preprocessing import LabelEncoder le = preprocessing.LabelEncoder() le.fit(X) le_dict = dict(zip(le.classes_, le.transform(le.classes_)))
Retrieve label for a single new item, if item is missing then set value as unknown
le_dict.get(new_item, '<Unknown>')
Retrieve labels for a Dataframe column:
df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With