I am trying to train an XGBoost classfier. The target variable y is binary.
DATA (Couldn't find a sample dataset to make this completely reproducible. Sorry about that).
X_train, X_validate, X_test (contain numerical and categorical data)
y_train, y_validate, y_test (the values are binary 1/0).
PREPROCESSOR.
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value=-999))])
preprocessor = ColumnTransformer(
remainder='passthrough',
transformers=[
('cat', categorical_transformer, selector(dtype_include="object")),
('num', numerical_transformer, selector(dtype_exclude="object"))
])
MODEL.
best_clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier',
xgb.XGBClassifier(
seed=42,
objective='binary:logistic',
missing=-999,
## optimal params
learning_rate = 0.1))])
best_clf.fit(X_train, y_train,
classifier__early_stopping_rounds=10,
classifier__eval_metric='aucpr',
classifier__eval_set=[(X_validate_preprocessed, y_validate)],
classifier__verbose=True)
Everyting works fine so far. I now have model. But I want to calibrate this model.
CALIBRATION.
I tried:
best_clf_calib = Pipeline(steps=[('preprocessor', preprocessor),
('calibrator', CalibratedClassifierCV(
base_estimator=best_clf.named_steps.classifier,
cv='prefit',
method='isotonic'))])
best_clf_calib.fit(X_validate, y_validate)
But it gives me the following error:
TypeError: predict_proba() got an unexpected keyword argument 'X'
Question: How specifically should I set the base_estimator parameter in CalibratedClassifierCV?. I tried setting
base_estimator = best_clf
But in that case, it seems that the pipeline gets run twice. Here is a diagram of the pipeline steps.

You don't necessarily need to downgrade sklearn.
I believe that the problem comes from XGBoost. It's explained here: https://github.com/dmlc/xgboost/pull/6555
XGBoost defined:
predict_proba(self, data, ...
instead of:
predict_proba(self, X, ...
And since sklearn 0.24 calls clf.predict_proba(X=X), an exception is thrown.
Here is an idea to fix the problem without changing the version of your packages: Create a class that inherits XGBoostClassifier to override predict_proba with the right argument names and call super().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With