Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is the feature score(/importance) in the XGBoost package calculated?

The command xgb.importance returns a graph of feature importance measured by an f score.

What does this f score represent and how is it calculated?

Output: Graph of feature importance Graph of feature importance

like image 562
ishido Avatar asked Dec 11 '15 07:12

ishido


People also ask

How is feature importance calculated XGBoost?

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for.

How is feature importance calculated?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

What is F-score in feature importance XGBoost?

The XGBoost library supports three methods for calculating feature importances: "weight" - the number of times a feature is used to split the data across all trees. (also called f-score elsewhere in the docs) "gain" - the average gain of the feature when it is used in trees.

Is XGBoost feature important?

The XGBoost library provides a built-in function to plot features ordered by their importance. features are automatically named according to their index in feature importance graph.


1 Answers

This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

It is about as basic a feature importance metric as you can get.

i.e. How many times was this variable split on?

The code for this method shows it is simply adding of the presence of a given feature in all the trees.

[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]

def get_fscore(self, fmap=''):     """Get feature importance of each feature.     Parameters     ----------     fmap: str (optional)        The name of feature map file     """     trees = self.get_dump(fmap)  ## dump all the trees to text     fmap = {}                         for tree in trees:              ## loop through the trees         for line in tree.split('\n'):     # text processing             arr = line.split('[')             if len(arr) == 1:             # text processing                  continue             fid = arr[1].split(']')[0]    # text processing             fid = fid.split('<')[0]       # split on the greater/less(find variable name)              if fid not in fmap:  # if the feature id hasn't been seen yet                 fmap[fid] = 1    # add it             else:                 fmap[fid] += 1   # else increment it     return fmap                  # return the fmap, which has the counts of each time a  variable was split on 
like image 74
T. Scharf Avatar answered Oct 03 '22 17:10

T. Scharf