I want to understand how the feature importance in xgboost is calculated by 'gain'. From https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:
‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).
In scikit-learn the feature importance is calculated by the gini impurity/information gain reduction of each node after splitting using a variable, i.e. weighted impurity average of node - weighted impurity average of left child node - weighted impurity average of right child node (see also: https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)
I wonder if xgboost also uses this approach using information gain or accuracy as stated in the citation above. I've tried to dig in the code of xgboost and found out this method (already cut off irrelevant parts):
def get_score(self, fmap='', importance_type='gain'):
trees = self.get_dump(fmap, with_stats=True)
importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue
# look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']')
# extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0])
# extract feature name from string before closing bracket
fid = fid[0].split('<')[0]
if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g
return gmap
So 'gain' is extracted from dump file of each booster but how is it actually measured?
Gain for XGBoost is influenced by the count of the number of samples affected by the splits based on a feature (Figure 2A), for LightGBM the total gain of splits which use the feature is summed (Figure 2B), while for CatBoost gain values show for each feature, how much on average the prediction changes if the feature ...
Feature Selection with XGBoost Feature Importance ScoresIt can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.
XGBoost feature accuracy is much better than the methods that are mentioned above since: Faster than Random Forests by far! It is way more reliable than Linear Models, thus the feature importance is usually much more accurate.
use built-in feature importance, use permutation based importance, use shap based importance.
Nice question. The gain is calculated using this equation:
For a deep explanation read this: https://xgboost.readthedocs.io/en/latest/tutorials/model.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With