I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
and my train code is:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train? How to get feature importance in xgboost?
The are 3 ways to compute the feature importance for the Xgboost: built-in feature importance. permutation based importance. importance computed with SHAP values.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
XGBoost feature accuracy is much better than the methods that are mentioned above since: Faster than Random Forests by far! It is way more reliable than Linear Models, thus the feature importance is usually much more accurate.
One super cool module of XGBoost is plot_importance which provides you the f-score of each feature, showing that feature's importance to the model. This is helpful for selecting features, not only for your XGB but also for any other similar model you may run on the data.
In your code you can get feature importance for each feature in dict form:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Explanation: The train() API's method get_score() is defined as:
get_score(fmap='', importance_type='weight')
https://xgboost.readthedocs.io/en/latest/python/python_api.html
Get the table containing scores and feature names, and then plot it.
feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features
For example:
Using sklearn API and XGBoost >= 0.81:
clf.get_booster().get_score(importance_type="gain")
or
regr.get_booster().get_score(importance_type="gain")
For this to work correctly, when you call regr.fit
(or clf.fit
), X
must be a pandas.DataFrame
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With