Using pipeline classifier inside of CalibratedClassifierCV

Question

I am trying to train an XGBoost classfier. The target variable y is binary.

DATA (Couldn't find a sample dataset to make this completely reproducible. Sorry about that).

X_train, X_validate, X_test (contain numerical and categorical data)

y_train, y_validate, y_test (the values are binary 1/0).

PREPROCESSOR.

categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))])
    
    
    numerical_transformer = Pipeline(steps=[  
        ('imputer', SimpleImputer(strategy='constant', fill_value=-999))])
    
    preprocessor = ColumnTransformer(
        remainder='passthrough',
        transformers=[
            ('cat', categorical_transformer, selector(dtype_include="object")),
            ('num', numerical_transformer, selector(dtype_exclude="object"))
        ])

MODEL.

best_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', 
                        xgb.XGBClassifier(
                        seed=42,
                        objective='binary:logistic',
                        missing=-999,
                        ## optimal params
                        learning_rate = 0.1))])

best_clf.fit(X_train, y_train, 
            classifier__early_stopping_rounds=10,
            classifier__eval_metric='aucpr',
            classifier__eval_set=[(X_validate_preprocessed, y_validate)],
            classifier__verbose=True)

Everyting works fine so far. I now have model. But I want to calibrate this model.

CALIBRATION.

I tried:

best_clf_calib = Pipeline(steps=[('preprocessor', preprocessor),
                                ('calibrator', CalibratedClassifierCV(
                                                    base_estimator=best_clf.named_steps.classifier,
                                                    cv='prefit', 
                                                    method='isotonic'))])

best_clf_calib.fit(X_validate, y_validate)

But it gives me the following error:

TypeError: predict_proba() got an unexpected keyword argument 'X'

Question: How specifically should I set the base_estimator parameter in CalibratedClassifierCV?. I tried setting

base_estimator = best_clf

But in that case, it seems that the pipeline gets run twice. Here is a diagram of the pipeline steps.

enter image description here

6761646f6e · Accepted Answer

You don't necessarily need to downgrade sklearn.

I believe that the problem comes from XGBoost. It's explained here: https://github.com/dmlc/xgboost/pull/6555

XGBoost defined:

predict_proba(self, data, ...

instead of:

predict_proba(self, X, ...

And since sklearn 0.24 calls clf.predict_proba(X=X), an exception is thrown.

Here is an idea to fix the problem without changing the version of your packages: Create a class that inherits XGBoostClassifier to override predict_proba with the right argument names and call super().

Using pipeline classifier inside of CalibratedClassifierCV

Tags:

python-3.x

machine-learning

scikit-learn

pipeline

xgboost

Cousin Dupree

1 Answers

6761646f6e

Recent Activity

Donate For Us

Using pipeline classifier inside of CalibratedClassifierCV

Tags:

python-3.x

machine-learning

scikit-learn

pipeline

xgboost

Cousin Dupree

1 Answers

6761646f6e

Related questions

Recent Activity

Donate For Us