Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I have to do one-hot-encoding separately for train and test dataset? [closed]

I'm working on a classification problem and I've split my data into train and test set.

I have few categorical columns (around 4 -6) and I am thinking of using pd.get_dummies to convert my categorical values to OneHotEncoding.

My question is do I have to do OneHotEncoding separately for train and test split? If that's the case I guess I better use sklearn OneHotEncoder because it supports fit and transform methods, right?

like image 923
user_6396 Avatar asked Apr 04 '19 21:04

user_6396


People also ask

Should encoding be done before or after train test split?

In 'Categorical Variables' exercise in Intermediate Machine Learning course the label encoding of categorical variables is performed after train-test split of the dataset hence there is a situation where the test data contains values that don't also appear in the training data, the encoder will throw an error, because ...

When should hot encoding be avoided?

When the categorical features present in the dataset are ordinal i.e for the data being like Junior, Senior, Executive, Owner. When the number of categories in the dataset is quite large. One Hot Encoding should be avoided in this case as it can lead to high memory consumption.

What challenges one may face by applying one hot encoding on a categorical variable of train data set?

What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset.

What is one hot encoding Why and when do you have to use it?

But, what is one hot encoding, and why do we use it? Most machine learning tutorials and tools require you to prepare data before it can be fit to a particular ML model. One hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions.


1 Answers

Generally, you want to treat the test set as though you did not have it during training. Whatever transformations you do to the train set should be done to the test set before you make predictions. So yes, you should do the transformation separately, but know that you are applying the same transformation.

For example, if the test set is missing one of the categories, there should still be a dummy variable for the missing category (which would be found in the training set), since the model you train will still expect that dummy variable. If the test set has an extra category, this should probably be handled with some "other" category.

Similarly, when scaling continuous variables say to [0,1], you use the range of the train set when scaling the test set. This could mean that the newly scaled test variable is outside of [0,1].


For completeness, here's how the one-hot encoding might look:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### Correct
train = pd.DataFrame(['A', 'B', 'A', 'C'])
test = pd.DataFrame(['B', 'A', 'D'])

enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(train)

enc.transform(train).toarray()
#array([[1., 0., 0.],
#       [0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]])

enc.transform(test).toarray()
#array([[0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 0.]])


### Incorrect
full = pd.concat((train, test))

enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(full)

enc.transform(train).toarray()
#array([[1., 0., 0., 0.],
#       [0., 1., 0., 0.],
#       [1., 0., 0., 0.],
#       [0., 0., 1., 0.]])

enc.transform(test).toarray()
#array([[0., 1., 0., 0.],
#       [1., 0., 0., 0.],
#       [0., 0., 0., 1.]])

Notice that for the incorrect approach there is an extra column for D (which only shows up in the test set). During training, we wouldn't know about D at all so there shouldn't be a column for it.

like image 119
mickey Avatar answered Nov 15 '22 22:11

mickey