Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature Importance with XGBClassifier

Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.

However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'

My code snippet is below:

from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_

It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?

like image 895
Minh Mai Avatar asked Jul 05 '16 21:07

Minh Mai


People also ask

Does XGBoost have feature importance?

The XGBoost library provides a built-in function to plot features ordered by their importance. features are automatically named according to their index in feature importance graph.

How is feature importance calculated in gradient boosting?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

Is XGBoost feature importance reliable?

XGBoost feature accuracy is much better than the methods that are mentioned above since: Faster than Random Forests by far! It is way more reliable than Linear Models, thus the feature importance is usually much more accurate.

Can we get feature importance from logistic regression?

We can fit a LogisticRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable. These coefficients can provide the basis for a crude feature importance score.


2 Answers

As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.

def get_xgb_imp(xgb, feat_names):
    from numpy import array
    imp_vals = xgb.booster().get_fscore()
    imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
    total = array(imp_dict.values()).sum()
    return {k:v/total for k,v in imp_dict.items()}


>>> import numpy as np
>>> from xgboost import XGBClassifier
>>> 
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>> 
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}
like image 131
David Avatar answered Sep 23 '22 14:09

David


For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.

import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')

from xgboost import plot_importance
plot_importance(xgb_model, )
like image 23
rosefun Avatar answered Sep 23 '22 14:09

rosefun