Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shap statistics

Tags:

python

shap

I used shap to determine the feature importance for multiple regression with correlated features.

import numpy as np
import pandas as pd  
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import shap


boston = load_boston()
regr = pd.DataFrame(boston.data)
regr.columns = boston.feature_names
regr['MEDV'] = boston.target

X = regr.drop('MEDV', axis = 1)
Y = regr['MEDV']

fit = LinearRegression().fit(X, Y)

explainer = shap.LinearExplainer(fit, X, feature_dependence = 'independent')
# I used 'independent' because the result is consistent with the ordinary 
# shapely values where `correlated' is not

shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values, X, plot_type = 'bar')

enter image description here

shap offers a chart to get the shap values. Is there also a statistic available? I am interested in the exact shap values. I read the Github repository and the documentation but I found nothing regarding this topic.

like image 700
Banjo Avatar asked Oct 15 '22 12:10

Banjo


1 Answers

When we look at shap_values we see that it contains some positive and negative numbers, and its dimensions equal the dimensions of boston dataset. Linear regression is a ML algorithm, which calculates optimal y = wx + b, where y is MEDV, x is feature vector and w is a vector of weights. In my opinion, shap_values stores wx - a matrix with the value of the each feauture multiplyed by the vector of weights calclulated by linear regression.

So to calculate wanted statistics, I first extracted absolute values and then averaged over them. The order is important! Next I used initial column names and sorted from biggest effect to smallest one. With this, I hope I have answered your question!:)

from matplotlib import pyplot as plt


#rataining only the size of effect
shap_values_abs = np.absolute(shap_values)

#dividing to get good numbers
means_norm = shap_values_abs.mean(axis = 0)/1e-15

#sorting values and names
idx = np.argsort(means_norm)
means = np.array(means_norm)[idx]
names = np.array(boston.feature_names)[idx]

#plotting
plt.figure(figsize=(10,10))
plt.barh(names, means)

Mean(Abs(shap_values)) plot

like image 114
gregoruar Avatar answered Oct 21 '22 06:10

gregoruar