I'm working on a classification problem and I've split my data into train and test set. I have few categorical columns (around 4 -6) and I am thinking of using <code>pd.get_dummies</code> to convert my categorical values to OneHotEncoding. My question is do I have to do OneHotEncoding separately for train and test split? If that's the case I guess I better use sklearn OneHotEncoder because it supports fit and transform methods, right?

Generally, you want to treat the test set as though you did not have it during training. Whatever transformations you do to the train set should be done to the test set before you make predictions. So yes, you should do the transformation separately, but know that you are applying the same transformation. For example, if the test set is missing one of the categories, there should still be a dummy variable for the missing category (which would be found in the training set), since the model you train will still expect that dummy variable. If the test set has an extra category, this should probably be handled with some "other" category. Similarly, when scaling continuous variables say to <code>[0,1]</code>, you use the range of the train set when scaling the test set. This could mean that the newly scaled test variable is outside of <code>[0,1]</code>. <hr> For completeness, here's how the one-hot encoding might look: <pre class="prettyprint lang-py prettyprint-override"><code>import pandas as pd from sklearn.preprocessing import OneHotEncoder ### Correct train = pd.DataFrame(['A', 'B', 'A', 'C']) test = pd.DataFrame(['B', 'A', 'D']) enc = OneHotEncoder(handle_unknown = 'ignore') enc.fit(train) enc.transform(train).toarray() #array([[1., 0., 0.], # [0., 1., 0.], # [1., 0., 0.], # [0., 0., 1.]]) enc.transform(test).toarray() #array([[0., 1., 0.], # [1., 0., 0.], # [0., 0., 0.]]) ### Incorrect full = pd.concat((train, test)) enc = OneHotEncoder(handle_unknown = 'ignore') enc.fit(full) enc.transform(train).toarray() #array([[1., 0., 0., 0.], # [0., 1., 0., 0.], # [1., 0., 0., 0.], # [0., 0., 1., 0.]]) enc.transform(test).toarray() #array([[0., 1., 0., 0.], # [1., 0., 0., 0.], # [0., 0., 0., 1.]]) </code></pre> Notice that for the incorrect approach there is an extra column for <code>D</code> (which only shows up in the test set). During training, we wouldn't know about <code>D</code> at all so there shouldn't be a column for it.

Do I have to do one-hot-encoding separately for train and test dataset? [closed]

1 Answers

Generally, you want to treat the test set as though you did not have it during training. Whatever transformations you do to the train set should be done to the test set before you make predictions. So yes, you should do the transformation separately, but know that you are applying the same transformation.

For example, if the test set is missing one of the categories, there should still be a dummy variable for the missing category (which would be found in the training set), since the model you train will still expect that dummy variable. If the test set has an extra category, this should probably be handled with some "other" category.

Similarly, when scaling continuous variables say to [0,1], you use the range of the train set when scaling the test set. This could mean that the newly scaled test variable is outside of [0,1].

For completeness, here's how the one-hot encoding might look:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### Correct
train = pd.DataFrame(['A', 'B', 'A', 'C'])
test = pd.DataFrame(['B', 'A', 'D'])

enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(train)

enc.transform(train).toarray()
#array([[1., 0., 0.],
#       [0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]])

enc.transform(test).toarray()
#array([[0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 0.]])


### Incorrect
full = pd.concat((train, test))

enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(full)

enc.transform(train).toarray()
#array([[1., 0., 0., 0.],
#       [0., 1., 0., 0.],
#       [1., 0., 0., 0.],
#       [0., 0., 1., 0.]])

enc.transform(test).toarray()
#array([[0., 1., 0., 0.],
#       [1., 0., 0., 0.],
#       [0., 0., 0., 1.]])

Notice that for the incorrect approach there is an extra column for D (which only shows up in the test set). During training, we wouldn't know about D at all so there shouldn't be a column for it.

119

answered Nov 15 '22 22:11

mickey

Related questions
                            
                                Python: First In First Out Print
                            
                                Sum corresponding elements of multiple python dictionaries
                            
                                How to capitalize the first letter of every sentence?
                            
                                Passing web data into Beautiful Soup - Empty list
                            
                                python docx set table cell background and text color
                            
                                Give 3 points and a plot circle
                            
                                How can I write a recursive function to reverse a linked list?
                            
                                Throw exception after first call
                            
                                How to automate the android phone back button using appium
                            
                                Pandas - Conditional Probability of a given specific b
                            
                                Hide login credentials in Python script
                            
                                Threading queue working example [duplicate]
                            
                                print success messages for asserts in python
                            
                                add dropdown list and text box in MatPlotLib and show plot according to the input
                            
                                Pandas select all columns without NaN
                            
                                cannot find reference for opencv functions in pycharm
                            
                                Generate a NumPy array with powers of 2
                            
                                How to query in AWS athena connected through S3 using lambda functions in python
                            
                                is it possible to retrain a previously saved keras model?
                            
                                How to move up n directories in Pythonic way?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Do I have to do one-hot-encoding separately for train and test dataset? [closed]

Tags:

python

machine-learning

one-hot-encoding

train-test-split

user_6396

People also ask

1 Answers

mickey

Recent Activity

Donate For Us