Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature importance in a binary classification and extracting SHAP values for one of the classes only

Suppose we have a binary classification problem, we have two classes of 1s and 0s as our target. I aim to use a tree classifier to predict 1s and 0s given the features. Further, I can use SHAP values to rank the feature importance that are predictive of 1s and 0s. Until now everything is good!

Now suppose that I want to know importance of features that are predictive of 1s only, what is the recommended approach there? I can split my data into two parts (nominally: df_tot = df_zeros + df_ones) and use df_ones in my classifier and then extract the SHAP values for that, however doing so the target would only have 1s and so the model does not really learn to classify anything. So I am wondering how does one approach such problem?

like image 739
Wiliam Avatar asked Dec 02 '20 15:12

Wiliam


People also ask

What is Shap feature importance?

SHAP feature importance is an alternative to permutation feature importance. There is a big difference between both importance measures: Permutation feature importance is based on the decrease in model performance. SHAP is based on magnitude of feature attributions.

How are Shap values used in feature selection?

SHAP helps when we perform feature selection with ranking-based algorithms. Instead of using the default variable importance, generated by gradient boosting, we select the best features like the ones with the highest shapley values.

Which function do we have to use for binary classification?

It uses the sigmoid activation function in order to produce a probability output in the range of 0 to 1 that can easily and automatically be converted to crisp class values.

What is binary class classification?

Binary classification is the task of classifying the elements of a set into two groups (each called class) on the basis of a classification rule.


1 Answers

Let's prepare some binary classification data:

from seaborn import load_dataset
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
import shap

titanic = load_dataset("titanic")
X = titanic.drop(["survived","alive","adult_male","who",'deck'],1)
y = titanic["survived"]

features = X.columns
cat_features = []
for cat in X.select_dtypes(exclude="number"):
    cat_features.append(cat)
#   think about meaningful ordering instead
    X[cat] = X[cat].astype("category").cat.codes.astype("category")

X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=.8, random_state=42)

clf = LGBMClassifier(max_depth=3, n_estimators=1000, objective="binary")
clf.fit(X_train,y_train, eval_set=(X_val,y_val), early_stopping_rounds=100, verbose=100) 

To answer your question, to extract shap values on a per class basis one may subset them by class label:

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
sv = np.array(shap_values)
y = clf.predict(X_train).astype("bool")
# shap values for survival
sv_survive = sv[:,y,:]
# shap values for dying
sv_die = sv[:,~y,:]

However a more interesting question what you can do with these values.

In general, one can gain valuable insights by looking at summary_plot (for the whole dataset):

shap.summary_plot(shap_values[1], X_train.astype("float"))

enter image description here

Interpretation (globally):

  • sex, pclass and age were most influential features in determining outcome
  • being a male, less affluent, and older decreased chances of survival

Top 3 global most influential features can be extracted as follows:

idx = np.abs(sv[1,:,:]).mean(0).argsort()
features[idx[:-4:-1]]
# Index(['sex', 'pclass', 'age'], dtype='object')

If you want to analyze on a per class basis, you may do this separately for survivors (sv[1,y,:]):

# top3 features for probability of survival
idx = sv[1,y,:].mean(0).argsort()
features[idx[:-4:-1]]
# Index(['sex', 'pclass', 'age'], dtype='object')

The same for those who did not survive (sv[0,~y,:]):

# top3 features for probability of dieing
idx = sv[0,~y,:].mean(0).argsort()
features[idx[:3]]
# Index(['alone', 'embark_town', 'parch'], dtype='object')

Note, we are using mean shap values here and saying we are interested in biggest values for survivors and lowest values for those who are not (lowest values, close to 0, may also mean having no constant, one-directional influence at all). Using mean on abs may also make sense, but the interpretation will be most influential, regardless of direction.

To make an educated choice either one prefers means or means of abs' one has to be aware of the following facts:

  • shap values could be both positive and negative
  • shap values are symmetrical, and increasing/decreasing probability of one class decreases/increases probability of the other by the same amount (due to p₁ = 1 - p₀)

Proof:

#shap values
sv = np.array(shap_values)
#base values
ev = np.array(explainer.expected_value)
sv_died, sv_survived = sv[:,0,:] # + constant
print(sv_died, sv_survived, sep="\n")
# [-0.73585563  1.24520748  0.70440429 -0.15443337 -0.01855845 -0.08430467  0.02916375 -0.04846619  0.         -0.01035171]
# [ 0.73585563 -1.24520748 -0.70440429  0.15443337  0.01855845  0.08430467 -0.02916375  0.04846619  0.          0.01035171]

Most probably you'll find out sex and age played the most influential role both for survivors and not; hence, rather than analyzing most influential features per class, it would be more interesting to see what made two passengers of the same sex and age one survive and the other not (hint: find such cases in the dataset, feed one as background, and analyze shap values for the other, or, try analyzing one class vs the other as background).

You may do further analysis with dependence_plot (on a global or per class basis):

shap.dependence_plot("sex", shap_values[1], X_train)

enter image description here

Interpretation (globally):

  • males had lower probability of survival (lower shap values)
  • pclass (affluence) was the next most influential factor: higher pclass (less affluence) decreased chance of survival for female and vice versa for males
like image 188
Sergey Bushmanov Avatar answered Sep 27 '22 17:09

Sergey Bushmanov