Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get feature importance in xgboost?

Tags:

python

xgboost

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore(), but it returns {}

and my train code is:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train? How to get feature importance in xgboost?

like image 758
modkzs Avatar asked Jun 04 '16 08:06

modkzs


People also ask

How does XGBoost Find feature importance?

The are 3 ways to compute the feature importance for the Xgboost: built-in feature importance. permutation based importance. importance computed with SHAP values.

How is feature importance calculated in gradient boosting?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

Is XGBoost feature importance reliable?

XGBoost feature accuracy is much better than the methods that are mentioned above since: Faster than Random Forests by far! It is way more reliable than Linear Models, thus the feature importance is usually much more accurate.

Does feature selection help XGBoost?

One super cool module of XGBoost is plot_importance which provides you the f-score of each feature, showing that feature's importance to the model. This is helpful for selecting features, not only for your XGB but also for any other similar model you may run on the data.


3 Answers

In your code you can get feature importance for each feature in dict form:

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:

get_score(fmap='', importance_type='weight')

  • fmap (str (optional)) – The name of feature map file.
  • importance_type
    • ‘weight’ - the number of times a feature is used to split the data across all trees.
    • ‘gain’ - the average gain across all splits the feature is used in.
    • ‘cover’ - the average coverage across all splits the feature is used in.
    • ‘total_gain’ - the total gain across all splits the feature is used in.
    • ‘total_cover’ - the total coverage across all splits the feature is used in.

https://xgboost.readthedocs.io/en/latest/python/python_api.html

like image 85
MLKing Avatar answered Oct 09 '22 11:10

MLKing


Get the table containing scores and feature names, and then plot it.

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features

For example:

enter image description here

like image 20
Catbuilts Avatar answered Oct 09 '22 09:10

Catbuilts


Using sklearn API and XGBoost >= 0.81:

clf.get_booster().get_score(importance_type="gain")

or

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit (or clf.fit), X must be a pandas.DataFrame.

like image 30
Sesquipedalism Avatar answered Oct 09 '22 11:10

Sesquipedalism