If a <code>sklearn.LabelEncoder</code> has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to <code>"<unknown>"</code>, and then explicitly add a corresponding class to the <code>LabelEncoder</code> afterward: <pre class="prettyprint lang-python prettyprint-override"><code># train and test are pandas.DataFrame's and c is whatever column le = LabelEncoder() le.fit(train[c]) test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s) le.classes_ = np.append(le.classes_, '<unknown>') train[c] = le.transform(train[c]) test[c] = le.transform(test[c]) </code></pre> This works, but is there a better solution? Update As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in <code>LabelEncoder.transform</code>, which now seems to use <code>np.searchsorted</code> (I don't know if it was the case before). So instead of appending the <code><unknown></code> class to the <code>LabelEncoder</code>'s list of already extracted classes, it needs to be inserted in sorted order: <pre class="prettyprint lang-python prettyprint-override"><code>import bisect le_classes = le.classes_.tolist() bisect.insort_left(le_classes, '<unknown>') le.classes_ = le_classes </code></pre> However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

LabelEncoder is basically a dictionary. You can extract and use it for future encoding: <pre class="prettyprint"><code>from sklearn.preprocessing import LabelEncoder le = preprocessing.LabelEncoder() le.fit(X) le_dict = dict(zip(le.classes_, le.transform(le.classes_))) </code></pre> Retrieve label for a single new item, if item is missing then set value as unknown <pre class="prettyprint"><code>le_dict.get(new_item, '<Unknown>') </code></pre> Retrieve labels for a Dataframe column: <pre class="prettyprint"><code>df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>)) </code></pre>

sklearn.LabelEncoder with never seen before values

Tags:

python

scikit-learn

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

# train and test are pandas.DataFrame's and c is whatever column le = LabelEncoder() le.fit(train[c]) test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s) le.classes_ = np.append(le.classes_, '<unknown>') train[c] = le.transform(train[c]) test[c] = le.transform(test[c])

This works, but is there a better solution?

Update

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

import bisect le_classes = le.classes_.tolist() bisect.insort_left(le_classes, '<unknown>') le.classes_ = le_classes

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

426

asked Jan 11 '14 01:01

cjauvin

1 Answers

LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

from sklearn.preprocessing import LabelEncoder  le = preprocessing.LabelEncoder() le.fit(X)  le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

Retrieve label for a single new item, if item is missing then set value as unknown

le_dict.get(new_item, '<Unknown>')

Retrieve labels for a Dataframe column:

df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))

135

answered Sep 25 '22 13:09

Rani

Related questions
                            
                                What are the differences between ipython and bpython?
                            
                                Automatically run %matplotlib inline in IPython Notebook
                            
                                Import python package from local directory into interpreter
                            
                                ValueError: could not convert string to float: id
                            
                                selenium with scrapy for dynamic page
                            
                                Union of 2 sets does not contain all items
                            
                                Is there any numpy group by function?
                            
                                How can I specify library versions in setup.py?
                            
                                Python setuptools: How can I list a private repository under install_requires?
                            
                                Argparse with required subparser
                            
                                how can i obtain pattern string from compiled regexp pattern in python
                            
                                open() gives FileNotFoundError/IOError: Errno 2 No such file or directory
                            
                                Finding index of nearest point in numpy arrays of x and y coordinates
                            
                                Is it possible to overload Python assignment?
                            
                                python socket.error: [Errno 98] Address already in use [closed]
                            
                                How to choose cross-entropy loss in TensorFlow?
                            
                                Windows is not passing command line arguments to Python programs executed from the shell
                            
                                How do I install an old version of Django on virtualenv?
                            
                                Python multiprocessing safely writing to a file
                            
                                Can't set attributes on instance of "object" class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With