Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature_importance vector in Decision Trees in SciKit Learn along with feature names

I am running the Decision Trees algorithm from SciKit Learn and I want to get the Feature_importance vector along with the features names so I can determine which features are dominant in the labeling process. Could you help me? Thank you.

like image 894
AlK Avatar asked Oct 20 '16 15:10

AlK


2 Answers

Suppose that you have samples as rows of a pandas.DataFrame:

from pandas import DataFrame
features = DataFrame({'f1': (1, 2, 2, 2), 'f2': (1, 1, 1, 1), 'f3': (3, 3, 1, 1)})
labels = ('a', 'a', 'b', 'b')

and then use a tree or a forest classifier:

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(features, labels)

Then the importances should match the frame columns:

for name, importance in zip(features.columns, classifier.feature_importances_):
    print(name, importance)

# f1 0.0
# f2 0.0
# f3 1.0
like image 60
wrwrwr Avatar answered Sep 19 '22 16:09

wrwrwr


A good suggestion by wrwrwr! Since the order of the feature importance values in the classifier's 'feature_importances_' property matches the order of the feature names in 'feature.columns', you can use the zip() function.

Further, it is also helpful to sort the features, and select the top N features to show.

Say you have created a classifier:

clf = DecisionTreeClassifier(random_state=0).fit(X_train,y_train)

Then you can print the top 5 features in descending order of importance:

for importance, name in sorted(zip(clf.feature_importances_, X_train.columns),reverse=True)[:5]:
    print (name, importance)
like image 42
X Z Avatar answered Sep 19 '22 16:09

X Z