Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When using Categorical Data in xgboost, how do I maintain the implied encoding?

I'm following this tutorial for using categorical data in xgboost: https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html

I define some toy data here where the "a" is the category and it ranges from 10 to 19:

# Define some toy data and specify "a" as a category
df = pd.DataFrame({
    "a": np.hstack([np.random.randint(10, 17, 50), np.random.randint(12, 20, 50)]),
    "b": np.random.normal(0., 4., 100),
    "c": np.hstack([np.random.normal(-2., 4., 50), np.random.normal(2., 4., 50)]),
    "d": np.hstack([np.zeros(50), np.ones(50)])
})
df["a"] = df["a"].astype("category")

I train a model, and visualize the first tree. Everything is working correctly, but the first split refers to categories "0, 1, 2, 3, 4, 5, 6" implying that there's some mapping between the categories defined above and the categories as xgboost understands them.

# Train a model using the native xgboost interface
dtrain = xgboost.DMatrix(df[["a", "b", "c"]], df["d"], enable_categorical=True)
native_model = xgboost.train({"objective": "binary:logistic", "eval_metric": "auc", "max_cat_to_onehot": 5, "tree_method": "hist"}, dtrain, 10)

First tree split

When I try and predict on new data, I don't know how to tell xgboost what the category mapping is that it inferred when it trained the model.

df.loc[0]
# a    12.000000
# b    -3.384966
# c    -4.169564
# d     0.000000
# Name: 0, dtype: float64

native_model.predict(dtrain)[0]
# 0.08855637

The prediction on the first data point seems reasonable enough.

df_predict = pd.DataFrame([{"a": 12, "b": -3.384966, "c": -4.169564}])
dpredict = xgboost.DMatrix(df_predict, feature_types=["c", "q", "q"], enable_categorical=True)
native_model.predict(dpredict)[0]
# 0.8009308 whereas I want it to match the above 0.08855637

Presumably, the prediction doesn't match because xgboost interprets the 12 as a non-existent category. The mapping doesn't seem to be saved off in the xgboost model json, so I can't tell xgboost which internal category the 12 refers to.

Is the only solution managing my own encoding and making sure my categorical variables are between [0, ncategories - 1] before creating the training DMatrix?

like image 434
mrphilroth Avatar asked Nov 28 '25 19:11

mrphilroth


1 Answers

Is the only solution managing my own encoding?

Yes, the XGBoost library assumes that category mappings are managed by the application, both in training phase and testing/deployment phase.

You can tie multiple operations together by moving from Python Learning API to Scikit-Learn API. The tricky part is implementing the "category" cast, but you can use the sklearn2pmml.preprocessing.CastTransformer for that.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn2pmml.preprocessing import CastTransformer
from xgboost import XGBClassifier

pipeline = Pipeline([
    ("mapper", ColumnTransformer([
        ("cat_a", CastTransformer(dtype = "category"), ["a"]),
    ], remainder = "passthrough")),
    ("classifier", XGBClassifier(tree_method = "hist", enable_categorical = True))
])
pipeline.fit(df[["a", "b", "c"]], df["d"])

print(pipeline._final_estimator)
like image 145
user1808924 Avatar answered Nov 30 '25 07:11

user1808924



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!