sklearn.preprocessing.OneHotEncoder: using drop and handle_unknown='ignore'

Tags:

I have some pandas.Series – s, below – that I want to one-hot-encode. I've found through research that the 'b' level is not important for my predictive modeling task. I can exclude it from my analysis like so:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
#        [0., 0.],
#        [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)

But when I go to transform a new series, one containing both 'b' and a new level, 'd', I get an error:

new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform

This is to be expected since I set handle_unknown='error' above. However, I'd like to completely ignore all classes except for ['a', 'c'] in both the fitting and subsequent transforming steps. I tried this:

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "handle_unknown must be 'error' when the drop parameter is " ValueError: handle_unknown must be 'error' when the drop parameter is specified, as both would create categories that are all zero.

It seems this pattern is not supported in scikit-learn. Does anyone know a scikit-learn-compatible pattern to accomplish this task?

339

asked Jan 31 '20 17:01

blacksite

1 Answers

You could also approach this using the following:

class IgnorantOneHotEncoder(OneHotEncoder):
    def transform(self, X, y=None):
        try:
            return super().transform(X)
        except ValueError as e:
            if 'Found unknown categories' in str(e):
                X = np.copy(X)
                # Keep track of indices corresponding to unknown categories
                unknown_categories_mask = ~np.isin(X, self.categories_[0]).ravel()
                # Overwrite the unknown categories in the input matrix, X, with the first known category
                X[unknown_categories_mask] = self.categories_[0][0]
                # Transform X, whose categories are all known now
                X = super().transform(X)
                # Overwrite originally unknown-category records with 0 to indicate
                # absence of any value for any category for that feature
                X[unknown_categories_mask, 0] = 0
                return X
            else:
                raise

Try it out:

>>> ienc = IgnorantOneHotEncoder(sparse=False)
>>> ienc.fit(s)
IgnorantOneHotEncoder(sparse=False)
>>> ienc.transform(s)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
>>> ienc.transform(new_s)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

answered Oct 17 '22 13:10

blacksite

Related questions
                            
                                Pandas: How to create a column that indicates when a value is present in another column a set number of rows in advance?
                            
                                Getting error while reading from S3 server using pyspark : [java.lang.IllegalArgumentException]
                            
                                Is there a way to combine RBAC with LDAP in Apache Airflow?
                            
                                QFileDialog always opens behind main window
                            
                                Dash: progress bar for reading files
                            
                                GEKKO Infeasible system of ODE equations of a fed-batch Bioreactor
                            
                                Django development server stops after logging into admin
                            
                                Difficulty transferring large files using fabric v2 / paramiko
                            
                                LSTM RNN to predict multiple time-steps and multiple features simultaneously
                            
                                How to show only the outline of a bar plot matplotlib
                            
                                How to get top k accuracy in semantic segmentation using PyTorch?
                            
                                "Bin labels must be one fewer than the number of bin edges" after passing pd.qcut duplicates='drop' kwarg
                            
                                Error loading Python lib '/tmp/_MEItueAuk/libpython3.7m.so.1.0': dlopen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found
                            
                                Multiple csrftoken cookies, is it a RFC requirement to only have 1 csrftoken?
                            
                                How do you add KeyManager to a kms key mocked using moto
                            
                                TypeError: Field 'id' expected a number but got DeferredAttribute object at 0x000002B6ADE878D0
                            
                                How to reset initialization in TensorFlow 2
                            
                                AttributeError: 'PosixPath' object has no attribute 'path'
                            
                                Keras custom metric sum is wrong
                            
                                What are some reasons Bayesian Optimization might not work for a CNN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn.preprocessing.OneHotEncoder: using drop and handle_unknown='ignore'

Tags:

python

machine-learning

scikit-learn

blacksite

People also ask

1 Answers

blacksite

Recent Activity

Donate For Us