I have a dataset which includes over 100 countries in it. I want to include these in an XGBoost model to make a classification prediction. I know that One Hot Encoding is the go-to process for this, but I would rather do something that wont increase the dimensionality so much and will be resilient to new values, so I'm trying binary classification using the category_encoders
package. http://contrib.scikit-learn.org/categorical-encoding/binary.html
Using this encoding helped my model out over using basic one-hot encoding, but how do I get back to the original labels after encoding?
I know about the inverse_transform
method, but that functions on the whole data frame. I need a way where I can put in a binary, or integer value and get back the original value.
Here's some example data taken from: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
import numpy as np
import pandas as pd
import category_encoders as ce
# make some data
df = pd.DataFrame({
'color':["a", "c", "a", "a", "b", "b"],
'outcome':[1, 2, 3, 2, 2, 2]})
# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)
# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])
# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)
I'd like to pass the values [0,0,1]
or 1
into a function and get back a
as a value.
The main reason for this is for looking at the feature importances of the model. I can get feature importances based on a column, but this will give me back a column id rather than the underlying value of a category that is the most important.
Please note that the article you reference suggests using the Binary Encoder for ordinal data only - that is, discrete data that has an order associated with it (small, medium, large), not nominal data (Red, White, Blue).
If you decide to use a Binary encoder, the order in which colors (or countries) are encoded will impact your performance. For example, assume red=001, white=010, and blue=011. When you apply an ML algorithm, it will see that red and blue have a feature in common (feature 3). This is probably not what you want.
In terms of applying the inverse transformation, you'll need to apply the inverse transformation to [0,0,1] in your example above, not "1". "1" is meaningless without context. You should be able to apply the inverse transformation to a single record (row) in your data, but not a single column. The inverse scaler will need to will operate on an object with the output dimension of the transformer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With