The command xgb.importance
returns a graph of feature importance measured by an f score.
What does this f score represent and how is it calculated?
Output: Graph of feature importance
Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
The XGBoost library supports three methods for calculating feature importances: "weight" - the number of times a feature is used to split the data across all trees. (also called f-score elsewhere in the docs) "gain" - the average gain of the feature when it is used in trees.
The XGBoost library provides a built-in function to plot features ordered by their importance. features are automatically named according to their index in feature importance graph.
This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
It is about as basic a feature importance metric as you can get.
i.e. How many times was this variable split on?
The code for this method shows it is simply adding of the presence of a given feature in all the trees.
[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]
def get_fscore(self, fmap=''): """Get feature importance of each feature. Parameters ---------- fmap: str (optional) The name of feature map file """ trees = self.get_dump(fmap) ## dump all the trees to text fmap = {} for tree in trees: ## loop through the trees for line in tree.split('\n'): # text processing arr = line.split('[') if len(arr) == 1: # text processing continue fid = arr[1].split(']')[0] # text processing fid = fid.split('<')[0] # split on the greater/less(find variable name) if fid not in fmap: # if the feature id hasn't been seen yet fmap[fid] = 1 # add it else: fmap[fid] += 1 # else increment it return fmap # return the fmap, which has the counts of each time a variable was split on
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With