Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does get_fscore() of an xgboost ML model do? [duplicate]

Does anybody how the numbers are calculated? In the documentation it says that this function "Get feature importance of each feature", but there is no explanation on how to interpret the results.

like image 922
Peter Lenaers Avatar asked Nov 11 '15 14:11

Peter Lenaers


People also ask

What is F score in feature importance XGBoost?

The XGBoost library supports three methods for calculating feature importances: "weight" - the number of times a feature is used to split the data across all trees. (also called f-score elsewhere in the docs) "gain" - the average gain of the feature when it is used in trees.

How is feature importance calculated in XGBoost?

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for.

What is Plot_importance XGBoost?

The XGBoost library provides a built-in function to plot features ordered by their importance. The function is called plot_importance() and can be used as follows: from xgboost import plot_importance # plot feature importance plot_importance(model) plt.show()


1 Answers

This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

It is about as basic a feature importance metric as you can get.

i.e. How many times was this variable split on?

The code for this method shows it is simply adding of the presence of a given feature in all the trees.

[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]

def get_fscore(self, fmap=''):
    """Get feature importance of each feature.
    Parameters
    ----------
    fmap: str (optional)
       The name of feature map file
    """
    trees = self.get_dump(fmap)  ## dump all the trees to text
    fmap = {}                    
    for tree in trees:              ## loop through the trees
        for line in tree.split('\n'):     # text processing
            arr = line.split('[')
            if len(arr) == 1:             # text processing 
                continue
            fid = arr[1].split(']')[0]    # text processing
            fid = fid.split('<')[0]       # split on the greater/less(find variable name)

            if fid not in fmap:  # if the feature id hasn't been seen yet
                fmap[fid] = 1    # add it
            else:
                fmap[fid] += 1   # else increment it
    return fmap                  # return the fmap, which has the counts of each time a  variable was split on
like image 113
T. Scharf Avatar answered Nov 15 '22 15:11

T. Scharf