In R there are pre-built functions to plot feature importance of Random Forest model. But in python such method seems to be missing. I search for a method in matplotlib
.
model.feature_importances
gives me following:
array([ 2.32421835e-03, 7.21472336e-04, 2.70491223e-03,
3.34521084e-03, 4.19443238e-03, 1.50108737e-03,
3.29160540e-03, 4.82320256e-01, 3.14117333e-03])
Then using following plotting function:
>> pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
>> pyplot.show()
I get a barplot but I would like to get barplot with labels while importance showing horizontally in a sorted fashion. I am also exploring seaborn
and was not able to find a method.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.
Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for.
Quick answer for data scientists that ain't got no time to waste:
Load the feature importances into a pandas series indexed by your column names, then use its plot method. For a classifier model
trained using X
:
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
Slightly more detailed answer with a full example:
Assuming you trained your model with data contained in a pandas dataframe, this is fairly painless if you load the feature importance into a panda's series, then you can leverage its indexing to get the variable names displayed easily. The plot argument kind='barh'
gives us a horizontal bar chart, but you could easily substitute this argument for kind='bar'
for a traditional bar chart with the feature names along the x-axis if you prefer.
nlargest(n)
is a pandas Series method which will return a subset of the series with the largest n
values. This is useful if you've got lots of features in your model and you only want to plot the most important.
A quick complete example using the classic Kaggle Titanic dataset...
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline # don't forget this if you're using jupyter!
X = pd.read_csv("titanic_train.csv")
X = X[['Pclass', 'Age', 'Fare', 'Parch', 'SibSp', 'Survived']].dropna()
y = X.pop('Survived')
model = RandomForestClassifier()
model.fit(X, y)
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(4)
.plot(kind='barh')) # some method chaining, because it's sexy!
Which will give you this:
Not exactly sure what you are looking for. Derived a example from here. As mentioned in the comment: you can change indices
to a list of labels at line plt.yticks(range(X.shape[1]), indices)
if you want to customize feature labels.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.barh(range(X.shape[1]), importances[indices],
color="r", xerr=std[indices], align="center")
# If you want to define your own labels,
# change indices to a list of labels on the following line.
plt.yticks(range(X.shape[1]), indices)
plt.ylim([-1, X.shape[1]])
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With