I'm following this tutorial for using categorical data in xgboost: https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html
I define some toy data here where the "a" is the category and it ranges from 10 to 19:
# Define some toy data and specify "a" as a category
df = pd.DataFrame({
"a": np.hstack([np.random.randint(10, 17, 50), np.random.randint(12, 20, 50)]),
"b": np.random.normal(0., 4., 100),
"c": np.hstack([np.random.normal(-2., 4., 50), np.random.normal(2., 4., 50)]),
"d": np.hstack([np.zeros(50), np.ones(50)])
})
df["a"] = df["a"].astype("category")
I train a model, and visualize the first tree. Everything is working correctly, but the first split refers to categories "0, 1, 2, 3, 4, 5, 6" implying that there's some mapping between the categories defined above and the categories as xgboost understands them.
# Train a model using the native xgboost interface
dtrain = xgboost.DMatrix(df[["a", "b", "c"]], df["d"], enable_categorical=True)
native_model = xgboost.train({"objective": "binary:logistic", "eval_metric": "auc", "max_cat_to_onehot": 5, "tree_method": "hist"}, dtrain, 10)
First tree split
When I try and predict on new data, I don't know how to tell xgboost what the category mapping is that it inferred when it trained the model.
df.loc[0]
# a 12.000000
# b -3.384966
# c -4.169564
# d 0.000000
# Name: 0, dtype: float64
native_model.predict(dtrain)[0]
# 0.08855637
The prediction on the first data point seems reasonable enough.
df_predict = pd.DataFrame([{"a": 12, "b": -3.384966, "c": -4.169564}])
dpredict = xgboost.DMatrix(df_predict, feature_types=["c", "q", "q"], enable_categorical=True)
native_model.predict(dpredict)[0]
# 0.8009308 whereas I want it to match the above 0.08855637
Presumably, the prediction doesn't match because xgboost interprets the 12 as a non-existent category. The mapping doesn't seem to be saved off in the xgboost model json, so I can't tell xgboost which internal category the 12 refers to.
Is the only solution managing my own encoding and making sure my categorical variables are between [0, ncategories - 1] before creating the training DMatrix?
Is the only solution managing my own encoding?
Yes, the XGBoost library assumes that category mappings are managed by the application, both in training phase and testing/deployment phase.
You can tie multiple operations together by moving from Python Learning API to Scikit-Learn API. The tricky part is implementing the "category" cast, but you can use the sklearn2pmml.preprocessing.CastTransformer for that.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn2pmml.preprocessing import CastTransformer
from xgboost import XGBClassifier
pipeline = Pipeline([
("mapper", ColumnTransformer([
("cat_a", CastTransformer(dtype = "category"), ["a"]),
], remainder = "passthrough")),
("classifier", XGBClassifier(tree_method = "hist", enable_categorical = True))
])
pipeline.fit(df[["a", "b", "c"]], df["d"])
print(pipeline._final_estimator)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With