How to handle One-Hot Encoding in production environment when number of features in Training and Test are different?

Tags:

While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:

Training Set:

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Hatchback     |
|  2  | Sedan         |
|  3  | Coupe         |
|  4  | SUV           |
-----------------------

After One- Hot Encoding this, this is what we get:

-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
|  1  |     1     |   0   |   0    |  0 |
|  2  |     0     |   1   |   0    |  0 |
|  3  |     0     |   0   |   1    |  0 |
|  4  |     0     |   0   |   0    |  1 |
-----------------------------------------

My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:

Test Set :

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Coupe         |
|  2  | Sedan         |
-----------------------

One-Hot Encoding results in :

---------------------------
| Ser | Coupe     | Sedan |
---------------------------
|  1  |     1     |   0   |
|  2  |     0     |   1   |
|  3  |     1     |   0   |
---------------------------

Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?

795

asked Jul 24 '18 18:07

Roshan Joe Vincent

1 Answers

I'll assume you are using pandas to do the one hot encoding. If not, you have to do some more work, but the logic is still the same.

import pandas as pd

known_categories = ['Sedan','Coupe','Limo'] # from training set

car_type = pd.Series(['Sedan','Ferrari']) # new category in production, 'Ferrari'

car_type = pd.Categorical(car_type, categories = known_categories)

pd.get_dummies(car_type)

Result is

    Sedan   Coupe   Limo
0   1.0      0.0    0.0    # Sedan entry
1   0.0      0.0    0.0    # Ferrari entry

Since Ferrari is not in the list of known categories, all the one ot encoded entries for the Ferrari are zero. If you find a new car type in your production data, the rows encoding the car type should all be 0.

153

answered Sep 16 '22 21:09

Demetri Pananos

Related questions
                            
                                Remove both duplicates in multiple lists python
                            
                                Python == with or vs. in list comparison
                            
                                PyX not installed correctly when using scapy
                            
                                Using mpi4py to parallelize a 'for' loop on a compute cluster
                            
                                How to create a discrete RGB colourmap with N colours using numpy
                            
                                set operation on a list of elements
                            
                                Python PIL can't open PDFs for some reason
                            
                                New PyYAML version breaks on most custom python objects - RepresenterError
                            
                                Generate misspelled words (typos)
                            
                                How to compute volume of 10-Dimentional sphere with Monte-Carlo-Method in Python?
                            
                                python 3.6 sum of the short periods between timestamps
                            
                                Is it ever appropriate to join two strings using the plus sign (+) over concatenating with curly brackets ({}) and `format` in Python 2.7?
                            
                                Installing pyodbc for Python 3.7 on Windows
                            
                                Delete column after conditional formatting (formula) using xlsxwriter
                            
                                Right place to put custom nbconvert templates
                            
                                Zero-dimensional numpy.ndarray : only element is a 2D array : how to access it?
                            
                                NumPy indexing: broadcasting with Boolean arrays
                            
                                Celery task getting SoftTimeLimitExceeded calling API
                            
                                How to mock pyplot.show in python (to prevent showing plots)
                            
                                Python - Counter in 2 million row table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle One-Hot Encoding in production environment when number of features in Training and Test are different?

Tags:

python

machine-learning

one-hot-encoding

feature-selection

Roshan Joe Vincent

People also ask

1 Answers

Demetri Pananos

Recent Activity

Donate For Us