Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values? The image of my code is provided below:

image

like image 463
Nikhil Mishra Avatar asked Nov 23 '17 17:11

Nikhil Mishra


People also ask

How do you handle missing data values?

One way of handling missing values is the deletion of the rows or columns having null values. If any columns have more than half of the values as null then you can drop the entire column. In the same way, rows can also be dropped if having one or more columns values as null.

How do you replace missing values in a data set?

Use the fillna() Method: The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones: Value: This is the value you want to insert into the missing rows. Method: Lets you fill missing values forward or in reverse.

How do you handle missing values in categorical columns?

For the numerical Columns you can try replacing the missing values by taking Mean / Median of the column values. This method is suitable for Categorical data which i assume is your case. You can try replacing missing vlaues in all three Columns with the most frequently occuring value in the given column.

What methods can be used to replace missing categorical values?

– Generally, replacing the missing values with the mean/median/mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could give satisfactory results.


1 Answers

Guys don't do this mistake, please!

Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.

In reality, some of the more viable options could be:

  1. Retrain your model periodically to account for new data.
  2. Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
  3. Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)
like image 197
Vast Academician Avatar answered Sep 24 '22 10:09

Vast Academician