NLTK package provides a method show_most_informative_features()
to find the most important features for both class, with output like:
contains(outstanding) = True pos : neg = 11.1 : 1.0
contains(seagal) = True neg : pos = 7.7 : 1.0
contains(wonderfully) = True pos : neg = 6.8 : 1.0
contains(damon) = True pos : neg = 5.9 : 1.0
contains(wasted) = True neg : pos = 5.8 : 1.0
As answered in this question How to get most informative features for scikit-learn classifiers? , this can also work in scikit-learn. However, for binary classifier, the answer in that question only outputs the best feature itself.
So my question is, how can I identify the feature's associated class, like the example above (outstanding is most informative in pos class, and seagal is most informative in negative class)?
EDIT: actually what I want is a list of most informative words for each class. How can I do that? Thanks!
You can get the same with two classes on the left and right side:
precision recall f1-score support
Irrelevant 0.77 0.98 0.86 129
Relevant 0.78 0.15 0.25 46
avg / total 0.77 0.77 0.70 175
-1.3914 davis 1.4809 austin
-1.1023 suicide 1.0695 march
-1.0609 arrested 1.0379 call
-1.0145 miller 1.0152 tsa
-0.8902 packers 0.9848 passengers
-0.8370 train 0.9547 pensacola
-0.7557 trevor 0.7432 bag
-0.7457 near 0.7056 conditt
-0.7359 military 0.7002 midamerica
-0.7302 berlin 0.6987 mark
-0.6880 april 0.6799 grenade
-0.6581 plane 0.6357 suspicious
-0.6351 disposal 0.6348 death
-0.5804 wwii 0.6053 flight
-0.5723 terminal 0.5745 marabi
def Show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With