Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature importance 'gain' in XGBoost

I want to understand how the feature importance in xgboost is calculated by 'gain'. From https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:

‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

In scikit-learn the feature importance is calculated by the gini impurity/information gain reduction of each node after splitting using a variable, i.e. weighted impurity average of node - weighted impurity average of left child node - weighted impurity average of right child node (see also: https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)

I wonder if xgboost also uses this approach using information gain or accuracy as stated in the citation above. I've tried to dig in the code of xgboost and found out this method (already cut off irrelevant parts):

def get_score(self, fmap='', importance_type='gain'):
    trees = self.get_dump(fmap, with_stats=True)

    importance_type += '='
    fmap = {}
    gmap = {}
    for tree in trees:
        for line in tree.split('\n'):
            # look for the opening square bracket
            arr = line.split('[')
            # if no opening bracket (leaf node), ignore this line
            if len(arr) == 1:
                continue

            # look for the closing bracket, extract only info within that bracket
            fid = arr[1].split(']')

            # extract gain or cover from string after closing bracket
            g = float(fid[1].split(importance_type)[1].split(',')[0])

            # extract feature name from string before closing bracket
            fid = fid[0].split('<')[0]

            if fid not in fmap:
                # if the feature hasn't been seen yet
                fmap[fid] = 1
                gmap[fid] = g
            else:
                fmap[fid] += 1
                gmap[fid] += g

    return gmap

So 'gain' is extracted from dump file of each booster but how is it actually measured?

like image 306
nellng Avatar asked Aug 05 '19 14:08

nellng


People also ask

How is gain calculated XGBoost?

Gain for XGBoost is influenced by the count of the number of samples affected by the splits based on a feature (Figure 2A), for LightGBM the total gain of splits which use the feature is summed (Figure 2B), while for CatBoost gain values show for each feature, how much on average the prediction changes if the feature ...

Is feature selection important for XGBoost?

Feature Selection with XGBoost Feature Importance ScoresIt can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.

Is XGBoost feature importance reliable?

XGBoost feature accuracy is much better than the methods that are mentioned above since: Faster than Random Forests by far! It is way more reliable than Linear Models, thus the feature importance is usually much more accurate.

What are three main types of feature importance in Xgboosting?

use built-in feature importance, use permutation based importance, use shap based importance.


1 Answers

Nice question. The gain is calculated using this equation:

enter image description here

For a deep explanation read this: https://xgboost.readthedocs.io/en/latest/tutorials/model.html

like image 143
seralouk Avatar answered Oct 12 '22 23:10

seralouk