Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use RFE with xgboost Booster?

I'm currently using xgb.train(...) which returns a booster but I'd like to use RFE to select the best 100 features. The returned booster cannot be used in RFE as it's not a sklearn estimator. XGBClassifier is the sklearn api into the xgboost library, however, I am not able to get the same results as with the xgb.train(...) method (10% worse on roc-auc). I've tried the sklearn boosters but they're not able to get similar results either. I've also tried to wrap the xgb.train(...) method in a class to add sklearn estimator methods but there's just too many to change. Is there some way to use the xgb.train(...) along with RFE from sklearn?

like image 835
pmdaly Avatar asked Feb 22 '21 01:02

pmdaly


People also ask

Do you need feature selection for XGBoost?

XGBoost does (1) for you. XGBoost does not do (2)/(3) for you. So you still have to do feature engineering yourself. Only a deep learning model could replace feature extraction for you.

Can XGBoost handle ordinal data?

XGBoost may assume that encoded integer values for each input variable have an ordinal relationship. For example that 'left-up' encoded as 0 and 'left-low' encoded as 1 for the breast-quad variable have a meaningful relationship as integers.

Does XGBoost require preprocessing?

Strictly speaking, tree-based methods do not require explicit data standardisation. XGBoost with a tree base learner would not therefore require this kind of preprocessing.

What is XGBoost boosting?

Boosting is a technique in machine learning that has been shown to produce models with high predictive accuracy. One of the most common ways to implement boosting in practice is to use XGBoost, short for “extreme gradient boosting.” This tutorial provides a step-by-step example of how to use XGBoost to fit a boosted model in R.

What is the base learner of XGBoost model?

There is a technique called the Gradient Boosted Trees whose base learner is CART (Classification and Regression Trees). XGBoost is an implementation of Gradient Boosted decision trees. XGBoost models majorly dominate in many Kaggle Competitions.

How do I install XGBoost in Python?

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API. The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example: sudo pip install xgboost

How to use XGBoost with scikit-learn?

Scikit-Learn API: It is a Scikit-Learn wrapper interface for XGBoost. It allows using XGBoost in a scikit-learn compatible way, the same way you would use any native scikit-learn model. Note that when using the Learning API you can input and access an evaluation metric, whereas when using the Scikit-learn API you have to calculate it.


1 Answers

For this kind of problem, I created shap-hypetune: a python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models

In your case, this enables you to perform RFE with XGBClassifier in a very simple and intuitive way:

from shaphypetune import BoostRFE

model = BoostRFE(XGBClassifier(), min_features_to_select=1, step=1)
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=6, verbose=0)

pred = model.predict(X_test)

As you can see, you can use all the fitting options available in the standard XGB API, like early_stopping_rounds or custom metrics, to customize the training process.

You can use shap-hypetune also to compute parameter tuning (also simultaneously with feature selection) or to compute feature selection with RFE or Boruta using SHAP feature importance. Full example available here

like image 69
Marco Cerliani Avatar answered Sep 28 '22 03:09

Marco Cerliani