Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?

I have a dataframe with all categorical columns which i am encoding using a oneHotEncoder from sklearn.preprocessing. My code is as below:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline


steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

As seen inside the OneHotEncoder the handle_unknown parameter takes either error or ignore. I want to know if there is a way to selectively ignore unknown categories for certain columns whereas give error for the other columns?

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
                   'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
                   'Flower':   ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
                   'Result':[1,2,3,4,5,6,]})

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

from sklearn.model_selection import train_test_split

X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)

print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)

print("MSE", MSE)

#Root Mean Squared Error:
from math import sqrt

RMSE = sqrt(MSE)
print("RMSE", RMSE)

#R-squared score:
R2_score = r2_score(y_test,y_pred)

print("R2_score", R2_score)

In this case for all the columns that is Country, Fruits and Flowers if there is a new value which comes the model would still be able to predict an output.

I want to know if there is a way to ignore unknown categories for Fruits and Flowers but however raise an error for unknown value in Country column?

like image 912
sayo Avatar asked Jun 14 '19 20:06

sayo


People also ask

How does OneHotEncoder work?

One-hot encoding is the process by which categorical data are converted into numerical data for use in machine learning. Categorical features are turned into binary features that are “one-hot” encoded, meaning that if a feature is represented by that column, it receives a 1 . Otherwise, it receives a 0 .

What is OneHotEncoder in Sklearn?

OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

How do I use OneHotEncoder in Python?

One-Hot Encoding in Python OneHotEncoder from SciKit library only takes numerical categorical values, hence any value of string type should be label encoded before one hot encoded. So taking the dataframe from the previous example, we will apply OneHotEncoder on column Bridge_Types_Cat.


1 Answers

I think ColumnTransformer() would help you to solve the problem. You can specify the list of columns for which you want to apply OneHotEncoderwith ignore for handle_unknown and similarly for error.

Convert your pipeline to the following using ColumnTransformer

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("ohe_ignore", OneHotEncoder(handle_unknown ='ignore'), 
                              ["Flower", "Fruits"]),
                        ("ohe_raise_error",  OneHotEncoder(handle_unknown ='error'),
                               ["Country"])])

steps = [('OneHotEncoder', ct),
         ('LReg', LinearRegression())]

pipeline = Pipeline(steps)

Now, when we want to predict

>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))

array([2.83333333])

>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['chk'], 'Flower': ['Rose']}))

array([3.66666667])


>>> pipeline.predict(pd.DataFrame({'Country': ['chk'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))

> ValueError: Found unknown categories ['chk'] in column 0 during
> transform

Note: ColumnTransformer is available from version 0.20.

like image 55
Venkatachalam Avatar answered Oct 11 '22 11:10

Venkatachalam