Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get original value for binary encoding using category_encoder package

I have a dataset which includes over 100 countries in it. I want to include these in an XGBoost model to make a classification prediction. I know that One Hot Encoding is the go-to process for this, but I would rather do something that wont increase the dimensionality so much and will be resilient to new values, so I'm trying binary classification using the category_encoders package. http://contrib.scikit-learn.org/categorical-encoding/binary.html

Using this encoding helped my model out over using basic one-hot encoding, but how do I get back to the original labels after encoding?

I know about the inverse_transform method, but that functions on the whole data frame. I need a way where I can put in a binary, or integer value and get back the original value.

Here's some example data taken from: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

import numpy as np
import pandas as pd
import category_encoders as ce

# make some data
df = pd.DataFrame({
 'color':["a", "c", "a", "a", "b", "b"], 
 'outcome':[1, 2, 3, 2, 2, 2]})

# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])

# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)

output

I'd like to pass the values [0,0,1] or 1 into a function and get back a as a value.

The main reason for this is for looking at the feature importances of the model. I can get feature importances based on a column, but this will give me back a column id rather than the underlying value of a category that is the most important.

like image 735
cburton Avatar asked Nov 07 '22 17:11

cburton


1 Answers

Please note that the article you reference suggests using the Binary Encoder for ordinal data only - that is, discrete data that has an order associated with it (small, medium, large), not nominal data (Red, White, Blue).

If you decide to use a Binary encoder, the order in which colors (or countries) are encoded will impact your performance. For example, assume red=001, white=010, and blue=011. When you apply an ML algorithm, it will see that red and blue have a feature in common (feature 3). This is probably not what you want.

In terms of applying the inverse transformation, you'll need to apply the inverse transformation to [0,0,1] in your example above, not "1". "1" is meaningless without context. You should be able to apply the inverse transformation to a single record (row) in your data, but not a single column. The inverse scaler will need to will operate on an object with the output dimension of the transformer.

like image 166
Jeff Avatar answered Nov 14 '22 22:11

Jeff