Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the actual values from the shap summary plot

How to go about extracting the numerical values for the shap summary plot so that the data can be viewed in a dataframe?:

enter image description here

Here is a MWE:

from sklearn.datasets import make_classification
from shap import Explainer, waterfall_plot, Explanation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate noisy Data
X, y = make_classification(n_samples=1000, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier()
model.fit(X_train, y_train)

explainer = Explainer(model)
sv = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_train, plot_type="bar")

I tried

np.abs(shap_values.values).mean(axis=0)

but I get a shape of (50,10). How do I get just the aggerated value for each feature to then sort for the feature importance?

like image 451
Joe Avatar asked Jan 20 '26 20:01

Joe


1 Answers

You've done this:

from sklearn.datasets import make_classification
from shap import Explainer, waterfall_plot, Explanation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import summary_plot


# Generate noisy data
X, y = make_classification(
    n_samples=1000,
    n_features=50,
    n_informative=9,
    n_redundant=0,
    n_repeated=0,
    n_classes=10,
    n_clusters_per_class=1,
    class_sep=9,
    flip_y=0.2,
    random_state=17,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier()
model.fit(X_train, y_train)

explainer = Explainer(model)
sv = explainer.shap_values(X_test)

summary_plot(sv, X_train, plot_type="bar")

enter image description here

Note, you have features 3, 29, 34 and so on at the top.

If you do:

np.abs(sv).shape

(10, 250, 50)

you'll find you've got 10 classes for 250 datapoints for 50 features.

If you aggregate, you'll get everything you need:

aggs = np.abs(sv).mean(1)
aggs.shape

(10, 50)

You can draw it:

sv_df = pd.DataFrame(aggs.T)
sv_df.plot(kind="barh",stacked=True)

enter image description here

And if it still doesn't look familiar, you can rearrange and filter:

sv_df.loc[sv_df.sum(1).sort_values(ascending=True).index[-10:]].plot(
    kind="barh", stacked=True
)

enter image description here

Conclusion:

sv_df are aggregated SHAP values, as in summary plot, arranged as features per row and classes per column.

Does it help?

like image 161
Sergey Bushmanov Avatar answered Jan 23 '26 15:01

Sergey Bushmanov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!