I have some pandas.Series
– s
, below – that I want to one-hot-encode. I've found through research that the 'b'
level is not important for my predictive modeling task. I can exclude it from my analysis like so:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
# [0., 0.],
# [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)
But when I go to transform a new series, one containing both 'b'
and a new level, 'd'
, I get an error:
new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform
This is to be expected since I set handle_unknown='error'
above. However, I'd like to completely ignore all classes except for ['a', 'c']
in both the fitting and subsequent transforming steps. I tried this:
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "
handle_unknown
must be 'error' when the drop parameter is " ValueError:handle_unknown
must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
It seems this pattern is not supported in scikit-learn. Does anyone know a scikit-learn-compatible pattern to accomplish this task?
One-hot encoding is a process by which categorical data (such as nominal data) are converted into numerical features of a dataset. This is often a required preprocessing step since machine learning models require numerical data.
(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.
You could also approach this using the following:
class IgnorantOneHotEncoder(OneHotEncoder):
def transform(self, X, y=None):
try:
return super().transform(X)
except ValueError as e:
if 'Found unknown categories' in str(e):
X = np.copy(X)
# Keep track of indices corresponding to unknown categories
unknown_categories_mask = ~np.isin(X, self.categories_[0]).ravel()
# Overwrite the unknown categories in the input matrix, X, with the first known category
X[unknown_categories_mask] = self.categories_[0][0]
# Transform X, whose categories are all known now
X = super().transform(X)
# Overwrite originally unknown-category records with 0 to indicate
# absence of any value for any category for that feature
X[unknown_categories_mask, 0] = 0
return X
else:
raise
Try it out:
>>> ienc = IgnorantOneHotEncoder(sparse=False)
>>> ienc.fit(s)
IgnorantOneHotEncoder(sparse=False)
>>> ienc.transform(s)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> ienc.transform(new_s)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With