Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn.preprocessing.OneHotEncoder: using drop and handle_unknown='ignore'

I have some pandas.Seriess, below – that I want to one-hot-encode. I've found through research that the 'b' level is not important for my predictive modeling task. I can exclude it from my analysis like so:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
#        [0., 0.],
#        [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)

But when I go to transform a new series, one containing both 'b' and a new level, 'd', I get an error:

new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform

This is to be expected since I set handle_unknown='error' above. However, I'd like to completely ignore all classes except for ['a', 'c'] in both the fitting and subsequent transforming steps. I tried this:

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "handle_unknown must be 'error' when the drop parameter is " ValueError: handle_unknown must be 'error' when the drop parameter is specified, as both would create categories that are all zero.

It seems this pattern is not supported in scikit-learn. Does anyone know a scikit-learn-compatible pattern to accomplish this task?

like image 339
blacksite Avatar asked Jan 31 '20 17:01

blacksite


People also ask

What is OneHotEncoder in Sklearn?

One-hot encoding is a process by which categorical data (such as nominal data) are converted into numerical features of a dataset. This is often a required preprocessing step since machine learning models require numerical data.

What is the difference between OneHotEncoder and Get_dummies?

(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.


1 Answers

You could also approach this using the following:

class IgnorantOneHotEncoder(OneHotEncoder):
    def transform(self, X, y=None):
        try:
            return super().transform(X)
        except ValueError as e:
            if 'Found unknown categories' in str(e):
                X = np.copy(X)
                # Keep track of indices corresponding to unknown categories
                unknown_categories_mask = ~np.isin(X, self.categories_[0]).ravel()
                # Overwrite the unknown categories in the input matrix, X, with the first known category
                X[unknown_categories_mask] = self.categories_[0][0]
                # Transform X, whose categories are all known now
                X = super().transform(X)
                # Overwrite originally unknown-category records with 0 to indicate
                # absence of any value for any category for that feature
                X[unknown_categories_mask, 0] = 0
                return X
            else:
                raise

Try it out:

>>> ienc = IgnorantOneHotEncoder(sparse=False)
>>> ienc.fit(s)
IgnorantOneHotEncoder(sparse=False)
>>> ienc.transform(s)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
>>> ienc.transform(new_s)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])
like image 86
blacksite Avatar answered Oct 17 '22 13:10

blacksite