Feature importance 'gain' in XGBoost

Tags:

I want to understand how the feature importance in xgboost is calculated by 'gain'. From https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:

‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

In scikit-learn the feature importance is calculated by the gini impurity/information gain reduction of each node after splitting using a variable, i.e. weighted impurity average of node - weighted impurity average of left child node - weighted impurity average of right child node (see also: https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)

I wonder if xgboost also uses this approach using information gain or accuracy as stated in the citation above. I've tried to dig in the code of xgboost and found out this method (already cut off irrelevant parts):

def get_score(self, fmap='', importance_type='gain'):
    trees = self.get_dump(fmap, with_stats=True)

    importance_type += '='
    fmap = {}
    gmap = {}
    for tree in trees:
        for line in tree.split('\n'):
            # look for the opening square bracket
            arr = line.split('[')
            # if no opening bracket (leaf node), ignore this line
            if len(arr) == 1:
                continue

            # look for the closing bracket, extract only info within that bracket
            fid = arr[1].split(']')

            # extract gain or cover from string after closing bracket
            g = float(fid[1].split(importance_type)[1].split(',')[0])

            # extract feature name from string before closing bracket
            fid = fid[0].split('<')[0]

            if fid not in fmap:
                # if the feature hasn't been seen yet
                fmap[fid] = 1
                gmap[fid] = g
            else:
                fmap[fid] += 1
                gmap[fid] += g

    return gmap

So 'gain' is extracted from dump file of each booster but how is it actually measured?

306

asked Aug 05 '19 14:08

nellng

1 Answers

Nice question. The gain is calculated using this equation:

enter image description here

For a deep explanation read this: https://xgboost.readthedocs.io/en/latest/tutorials/model.html

143

answered Oct 12 '22 23:10

seralouk

Related questions
                            
                                How to organize migration for two related models and automatically set default field value for id of newly created object?
                            
                                Ampersand "&" syntax error running any Python script in VSCode?
                            
                                How to use series.isin with different sets for different values?
                            
                                How to access multi-level index in pandas data frame?
                            
                                Calling the invoke API action failed with this message: Network Error
                            
                                Why would we use to() method in pytorch?
                            
                                Keras: how to reset optimizer state?
                            
                                Error when using statsmodels with pyinstaller
                            
                                Invalid conversion error in cv2 file while installing opencv 3.3.0 on Raspberry Pi Stretch
                            
                                How import package from PyPI with hyphen in name?
                            
                                pandas: write dataframe to excel file *object* (not file)?
                            
                                Change python 3.7 default encoding from cp1252 to cp65001 aka UTF-8
                            
                                Dask + pyinstaller fails
                            
                                Google OR tools - train scheduling problem
                            
                                How to detect good features for rotationally aligning microscope images to a template
                            
                                Sum of range(1,n,2) values using recursion
                            
                                How to replicate Python 2 style len() in Python 3?
                            
                                How to click a webpage button without going on the webpage
                            
                                'RefVariable' object has no attribute '_id'
                            
                                How to place percentage orders with Binance API and Python-CCXT?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Feature importance 'gain' in XGBoost

Tags:

python

scikit-learn

xgboost

boosting

information-gain

nellng

People also ask

1 Answers

seralouk

Recent Activity

Donate For Us