When using Categorical Data in xgboost, how do I maintain the implied encoding?

Question

I'm following this tutorial for using categorical data in xgboost: https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html

I define some toy data here where the "a" is the category and it ranges from 10 to 19:

# Define some toy data and specify "a" as a category
df = pd.DataFrame({
    "a": np.hstack([np.random.randint(10, 17, 50), np.random.randint(12, 20, 50)]),
    "b": np.random.normal(0., 4., 100),
    "c": np.hstack([np.random.normal(-2., 4., 50), np.random.normal(2., 4., 50)]),
    "d": np.hstack([np.zeros(50), np.ones(50)])
})
df["a"] = df["a"].astype("category")

I train a model, and visualize the first tree. Everything is working correctly, but the first split refers to categories "0, 1, 2, 3, 4, 5, 6" implying that there's some mapping between the categories defined above and the categories as xgboost understands them.

# Train a model using the native xgboost interface
dtrain = xgboost.DMatrix(df[["a", "b", "c"]], df["d"], enable_categorical=True)
native_model = xgboost.train({"objective": "binary:logistic", "eval_metric": "auc", "max_cat_to_onehot": 5, "tree_method": "hist"}, dtrain, 10)

First tree split

When I try and predict on new data, I don't know how to tell xgboost what the category mapping is that it inferred when it trained the model.

df.loc[0]
# a    12.000000
# b    -3.384966
# c    -4.169564
# d     0.000000
# Name: 0, dtype: float64

native_model.predict(dtrain)[0]
# 0.08855637

The prediction on the first data point seems reasonable enough.

df_predict = pd.DataFrame([{"a": 12, "b": -3.384966, "c": -4.169564}])
dpredict = xgboost.DMatrix(df_predict, feature_types=["c", "q", "q"], enable_categorical=True)
native_model.predict(dpredict)[0]
# 0.8009308 whereas I want it to match the above 0.08855637

Presumably, the prediction doesn't match because xgboost interprets the 12 as a non-existent category. The mapping doesn't seem to be saved off in the xgboost model json, so I can't tell xgboost which internal category the 12 refers to.

Is the only solution managing my own encoding and making sure my categorical variables are between [0, ncategories - 1] before creating the training DMatrix?

user1808924 · Accepted Answer

Is the only solution managing my own encoding?

Yes, the XGBoost library assumes that category mappings are managed by the application, both in training phase and testing/deployment phase.

You can tie multiple operations together by moving from Python Learning API to Scikit-Learn API. The tricky part is implementing the "category" cast, but you can use the sklearn2pmml.preprocessing.CastTransformer for that.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn2pmml.preprocessing import CastTransformer
from xgboost import XGBClassifier

pipeline = Pipeline([
    ("mapper", ColumnTransformer([
        ("cat_a", CastTransformer(dtype = "category"), ["a"]),
    ], remainder = "passthrough")),
    ("classifier", XGBClassifier(tree_method = "hist", enable_categorical = True))
])
pipeline.fit(df[["a", "b", "c"]], df["d"])

print(pipeline._final_estimator)

When using Categorical Data in xgboost, how do I maintain the implied encoding?

Tags:

python

machine-learning

xgboost

mrphilroth

1 Answers

user1808924

Recent Activity

Donate For Us

When using Categorical Data in xgboost, how do I maintain the implied encoding?

Tags:

python

machine-learning

xgboost

mrphilroth

1 Answers

user1808924

Related questions

Recent Activity

Donate For Us