I have a dataset which follows the one-hot encoding pattern and my dependent variable is also binary. The first part of my code lists the important variables for the entire dataset. I used the method as mentioned in this stackoverflow post, "Using scikit to determine contributions of each feature to a specific class prediction". I am unsure as to what output I am getting. The feature importance ranks the most important feature for the entire model, "Delay Related DMS With Advice", in my case. I interpret it as that, this variable should be important either in Class 0 or Class 1 but from the output I get, it is unimportant in both Classes. The code in the stackoverflow I shared above, also shows that when the DV is binary, the output of Class 0 is the exact opposite (in terms of sign +/-) of Class 1. In my case, the values are different in both classes.
Here is how the plots look like:-
Feature Importance - Overall Model
Feature Importance - Class 0
Feature Importance - Class 1
The 2nd part of my code shows cumulative feature importances but looking at the [plot] shows that none of the variables are important. Is my formula wrong or my interpretation wrong or both?
plot
Here is my code;
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import scale
from sklearn.ensemble import ExtraTreesClassifier
##get_ipython().run_line_magic('matplotlib', 'inline')
file = r'RCM_Binary.csv'
data = pd.read_csv()
print("data loaded successfully ...")
# Define features and target
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
#split to training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)
# define classifier and fitting data
forest = ExtraTreesClassifier(random_state=1)
forest.fit(X_train, y_train)
# predict and get confusion matrix
y_pred = forest.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
#Applying 10-fold cross validation
accuracies = cross_val_score(estimator=forest, X=X_train, y=y_train, cv=10)
print("accuracy (10-fold): ", np.mean(accuracies))
# Features importances
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
feature_list = [X.columns[indices[f]] for f in range(X.shape[1])] #names of features.
ff = np.array(feature_list)
# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d (%f) name: %s" % (f + 1, indices[f], importances[indices[f]], ff[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.rcParams['figure.figsize'] = [16, 6]
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), ff[indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()
## The new additions to get feature importance to classes:
# To get the importance according to each class:
def class_feature_importance(X, Y, feature_importances):
N, M = X.shape
X = scale(X)
out = {}
for c in set(Y):
out[c] = dict(
zip(range(N), np.mean(X[Y==c, :], axis=0)*feature_importances)
)
return out
result = class_feature_importance(X, y, importances)
print (json.dumps(result,indent=4))
# Plot the feature importances of the forest
titles = ["Did not Divert", "Diverted"]
for t, i in zip(titles, range(len(result))):
plt.figure()
plt.rcParams['figure.figsize'] = [16, 6]
plt.title(t)
plt.bar(range(len(result[i])), result[i].values(),
color="r", align="center")
plt.xticks(range(len(result[i])), ff[list(result[i].keys())], rotation=90)
plt.xlim([-1, len(result[i])])
plt.show()
The 2nd part of the code
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
# List of features sorted from most to least important
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# Cumulative importances
cumulative_importances = np.cumsum(sorted_importances)
# Make a line graph
plt.plot(x_values, cumulative_importances, 'g-')
# Draw line at 95% of importance retained
plt.hlines(y = 0.95, xmin=0, xmax=len(sorted_importances), color = 'r', linestyles = 'dashed')
# Format x ticks and labels
plt.xticks(x_values, sorted_features, rotation = 'vertical')
# Axis labels and title
plt.xlabel('Variable'); plt.ylabel('Cumulative Importance'); plt.title('Cumulative Importances');
plt.show()
# Find number of features for cumulative importance of 95%
# Add 1 because Python is zero-indexed
print('Number of features for 95% importance:', np.where(cumulative_importances > 0.95)[0][0] + 1)
The question might be outdated, but just in case anyone is interested:
The class_feature_importance
function you copied from your source uses lines as features and columns for samples, while you do it the other way round, as most people. Therefore the calculation of feature importances per class goes awry. Changing the code to
zip(range(M))
should solve it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With