Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?

Tags:

I have a dataframe with all categorical columns which i am encoding using a oneHotEncoder from sklearn.preprocessing. My code is as below:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline


steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

As seen inside the OneHotEncoder the handle_unknown parameter takes either error or ignore. I want to know if there is a way to selectively ignore unknown categories for certain columns whereas give error for the other columns?

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
                   'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
                   'Flower':   ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
                   'Result':[1,2,3,4,5,6,]})

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

from sklearn.model_selection import train_test_split

X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)

print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)

print("MSE", MSE)

#Root Mean Squared Error:
from math import sqrt

RMSE = sqrt(MSE)
print("RMSE", RMSE)

#R-squared score:
R2_score = r2_score(y_test,y_pred)

print("R2_score", R2_score)

In this case for all the columns that is Country, Fruits and Flowers if there is a new value which comes the model would still be able to predict an output.

I want to know if there is a way to ignore unknown categories for Fruits and Flowers but however raise an error for unknown value in Country column?

912

asked Jun 14 '19 20:06

sayo

1 Answers

I think ColumnTransformer() would help you to solve the problem. You can specify the list of columns for which you want to apply OneHotEncoderwith ignore for handle_unknown and similarly for error.

Convert your pipeline to the following using ColumnTransformer

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("ohe_ignore", OneHotEncoder(handle_unknown ='ignore'), 
                              ["Flower", "Fruits"]),
                        ("ohe_raise_error",  OneHotEncoder(handle_unknown ='error'),
                               ["Country"])])

steps = [('OneHotEncoder', ct),
         ('LReg', LinearRegression())]

pipeline = Pipeline(steps)

Now, when we want to predict

>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))

array([2.83333333])

>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['chk'], 'Flower': ['Rose']}))

array([3.66666667])


>>> pipeline.predict(pd.DataFrame({'Country': ['chk'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))

> ValueError: Found unknown categories ['chk'] in column 0 during
> transform

Note: ColumnTransformer is available from version 0.20.

answered Oct 11 '22 11:10

Venkatachalam

Related questions
                            
                                Using TFRecords with keras
                            
                                Django: ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
                            
                                conda equivalent of pip install
                            
                                Jupyter: How to change color for widgets like SelectMultiple()?
                            
                                Class with only class methods
                            
                                Tensorflow dilation behave differently than morphological dilation
                            
                                Python 3: How to submit an async function to a threadPool?
                            
                                django deploy to Heroku : Server Error(500)
                            
                                Send numpy array as bytes from python to JS through Flask
                            
                                Is there a C# equivalent of Pythons chr and ord?
                            
                                Python string concatenation internal details
                            
                                unsupported operand type(s) for +: 'int' and 'str' using Pandas mean
                            
                                Upload CSV file using Python Flask and process it
                            
                                SQLAlchemy verify SSL connection
                            
                                Is there a pytorch method to check the number of cpus?
                            
                                Merge 'left', but override 'right' values where possible
                            
                                Resample with categories in pandas, keep non-numerical columns
                            
                                How to reshape a list without numpy
                            
                                Python console in Power BI
                            
                                BucketIterator throws 'Field' object has no attribute 'vocab'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?

Tags:

python

pandas

one-hot-encoding

scikit-learn

sayo

People also ask

1 Answers

Venkatachalam

Recent Activity

Donate For Us