How to get most informative features for scikit-learn classifiers?

Tags:

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:

viagra = None          ok : spam     =      4.5 : 1.0 hello = True           ok : spam     =      4.5 : 1.0 hello = None           spam : ok     =      3.3 : 1.0 viagra = True          spam : ok     =      3.3 : 1.0 casino = True          spam : ok     =      2.0 : 1.0 casino = None          ok : spam     =      1.5 : 1.0

My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.

If there is no such function yet, does somebody know a workaround how to get to those values?

558

asked Jun 20 '12 09:06

tobigue

2 Answers

The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer, and you are using a linear model (e.g. LinearSVC or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):

def print_top10(vectorizer, clf, class_labels):     """Prints features with the highest coefficient values, per class"""     feature_names = vectorizer.get_feature_names()     for i, class_label in enumerate(class_labels):         top10 = np.argsort(clf.coef_[i])[-10:]         print("%s: %s" % (class_label,               " ".join(feature_names[j] for j in top10)))

This is for multiclass classification; for the binary case, I think you should use clf.coef_[0] only. You may have to sort the class_labels.

104

answered Nov 09 '22 00:11

Fred Foo

With the help of larsmans code I came up with this code for the binary case:

def show_most_informative_features(vectorizer, clf, n=20):     feature_names = vectorizer.get_feature_names()     coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))     top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])     for (coef_1, fn_1), (coef_2, fn_2) in top:         print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

answered Nov 08 '22 23:11

tobigue

Related questions
                            
                                Determining duplicate values in an array
                            
                                Why set_xticks doesn't set the labels of ticks?
                            
                                How can I create an array/list of dictionaries in python?
                            
                                How can I upgrade NumPy?
                            
                                Connecting postgresql with sqlalchemy
                            
                                Tuple pairs, finding minimum using python
                            
                                How to completely traverse a complex dictionary of unknown depth?
                            
                                Title for matplotlib legend
                            
                                Python Pandas: Group datetime column into hour and minute aggregations
                            
                                Mapping dictionary value to list
                            
                                Overriding urllib2.HTTPError or urllib.error.HTTPError and reading response HTML anyway
                            
                                Quoting backslashes in Python string literals [duplicate]
                            
                                python: iterate over dictionary sorted by key
                            
                                Troubleshooting "descriptor 'date' requires a 'datetime.datetime' object but received a 'int'"
                            
                                Python: Inherit the superclass __init__
                            
                                How to get everything from the list except the first element using list slicing [duplicate]
                            
                                How to get filename of the __main__ module in Python?
                            
                                Converting String to Int using try/except in Python
                            
                                Pandas long to wide reshape, by two variables
                            
                                Remove dictionary from list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get most informative features for scikit-learn classifiers?

Tags:

python

machine-learning

classification

scikit-learn

tobigue

People also ask

2 Answers

Fred Foo

tobigue

Recent Activity

Donate For Us