How do I resolve one hot encoding if my test data has missing values in a col?

Tags:

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values? The image of my code is provided below:

image

463

asked Nov 23 '17 17:11

Nikhil Mishra

1 Answers

Guys don't do this mistake, please!

Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.

In reality, some of the more viable options could be:

Retrain your model periodically to account for new data.
Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)

197

answered Sep 24 '22 10:09

Vast Academician

Related questions
                            
                                Is it possible to build reports with Python Pandas?
                            
                                Convert two numpy array to dataframe
                            
                                What is the quickest way to increment date string YYYY-MM-DD in Python?
                            
                                Syntax to select previous row in pandas after filtering
                            
                                How to count the number of columns with a value on each row in python?
                            
                                Pandas - find specific value in entire dataframe
                            
                                How to check if all the elements in list are present in pandas column
                            
                                Joining Table/DataFrames with common Column in Python
                            
                                Pandas Dataframe object types fillna exception over different datatypes
                            
                                read frame with sqlalchemy, mysql and pandas
                            
                                unstack multiindex dataframe to flat data frame in pandas
                            
                                Partition pandas .diff() in multi-index level
                            
                                python bin data and return bin midpoint (maybe using pandas.cut and qcut)
                            
                                pandas merge dataframes by closest time
                            
                                A per-hour histogram of datetime using Pandas
                            
                                Pandas move rows from 1 DF to another DF
                            
                                Pandas: Creating new data frame from only certain columns
                            
                                How to replace inf in a numpy array with zero
                            
                                Convert a pandas groupby object to list of dataframes
                            
                                Python Pandas Plotting Two BARH side by side

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I resolve one hot encoding if my test data has missing values in a col?

Tags:

pandas

machine-learning

numpy

one-hot-encoding

Nikhil Mishra

People also ask

1 Answers

Vast Academician

Recent Activity

Donate For Us