Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plot Feature Importance with feature names

In R there are pre-built functions to plot feature importance of Random Forest model. But in python such method seems to be missing. I search for a method in matplotlib.

model.feature_importances gives me following:

array([  2.32421835e-03,   7.21472336e-04,   2.70491223e-03,
         3.34521084e-03,   4.19443238e-03,   1.50108737e-03,
         3.29160540e-03,   4.82320256e-01,   3.14117333e-03])

Then using following plotting function:

>> pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
>> pyplot.show()

I get a barplot but I would like to get barplot with labels while importance showing horizontally in a sorted fashion. I am also exploring seaborn and was not able to find a method.

like image 401
add-semi-colons Avatar asked Jun 13 '17 02:06

add-semi-colons


People also ask

How do you identify a feature important in a decision tree?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

How do you determine the importance of a feature?

The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.

How is feature importance calculated in Xgboost?

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for.


2 Answers

Quick answer for data scientists that ain't got no time to waste:

Load the feature importances into a pandas series indexed by your column names, then use its plot method. For a classifier model trained using X:

feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')

Slightly more detailed answer with a full example:

Assuming you trained your model with data contained in a pandas dataframe, this is fairly painless if you load the feature importance into a panda's series, then you can leverage its indexing to get the variable names displayed easily. The plot argument kind='barh' gives us a horizontal bar chart, but you could easily substitute this argument for kind='bar' for a traditional bar chart with the feature names along the x-axis if you prefer.

nlargest(n) is a pandas Series method which will return a subset of the series with the largest n values. This is useful if you've got lots of features in your model and you only want to plot the most important.

A quick complete example using the classic Kaggle Titanic dataset...

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline            # don't forget this if you're using jupyter!

X = pd.read_csv("titanic_train.csv")
X = X[['Pclass', 'Age', 'Fare', 'Parch', 'SibSp', 'Survived']].dropna()
y = X.pop('Survived')

model = RandomForestClassifier()
model.fit(X, y)

(pd.Series(model.feature_importances_, index=X.columns)
   .nlargest(4)
   .plot(kind='barh'))        # some method chaining, because it's sexy!

Which will give you this:

sklearn random forest feature importances

like image 90
fordy Avatar answered Sep 19 '22 10:09

fordy


Not exactly sure what you are looking for. Derived a example from here. As mentioned in the comment: you can change indices to a list of labels at line plt.yticks(range(X.shape[1]), indices) if you want to customize feature labels.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
                           n_features=10,
                           n_informative=3,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.barh(range(X.shape[1]), importances[indices],
       color="r", xerr=std[indices], align="center")
# If you want to define your own labels,
# change indices to a list of labels on the following line.
plt.yticks(range(X.shape[1]), indices)
plt.ylim([-1, X.shape[1]])
plt.show()

enter image description here

like image 30
Y. Luo Avatar answered Sep 21 '22 10:09

Y. Luo