Does anybody how the numbers are calculated? In the documentation it says that this function "Get feature importance of each feature", but there is no explanation on how to interpret the results.
The XGBoost library supports three methods for calculating feature importances: "weight" - the number of times a feature is used to split the data across all trees. (also called f-score elsewhere in the docs) "gain" - the average gain of the feature when it is used in trees.
Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for.
The XGBoost library provides a built-in function to plot features ordered by their importance. The function is called plot_importance() and can be used as follows: from xgboost import plot_importance # plot feature importance plot_importance(model) plt.show()
This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
It is about as basic a feature importance metric as you can get.
i.e. How many times was this variable split on?
The code for this method shows it is simply adding of the presence of a given feature in all the trees.
[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]
def get_fscore(self, fmap=''):
"""Get feature importance of each feature.
Parameters
----------
fmap: str (optional)
The name of feature map file
"""
trees = self.get_dump(fmap) ## dump all the trees to text
fmap = {}
for tree in trees: ## loop through the trees
for line in tree.split('\n'): # text processing
arr = line.split('[')
if len(arr) == 1: # text processing
continue
fid = arr[1].split(']')[0] # text processing
fid = fid.split('<')[0] # split on the greater/less(find variable name)
if fid not in fmap: # if the feature id hasn't been seen yet
fmap[fid] = 1 # add it
else:
fmap[fid] += 1 # else increment it
return fmap # return the fmap, which has the counts of each time a variable was split on
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With