Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get most informative features for scikit-learn classifiers?

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:

viagra = None          ok : spam     =      4.5 : 1.0 hello = True           ok : spam     =      4.5 : 1.0 hello = None           spam : ok     =      3.3 : 1.0 viagra = True          spam : ok     =      3.3 : 1.0 casino = True          spam : ok     =      2.0 : 1.0 casino = None          ok : spam     =      1.5 : 1.0 

My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.

If there is no such function yet, does somebody know a workaround how to get to those values?

like image 558
tobigue Avatar asked Jun 20 '12 09:06

tobigue


People also ask

What method does scikit-learn to find the best classification hypothesis for the training data?

Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.

What method does scikit-learn used for classifying operational data?

Scikit-learn is a library that provides a variety of both supervised and unsupervised machine learning techniques. Supervised machine learning refers to the problem of inferring a function from labeled training data, and it comprises both regression and classification.


2 Answers

The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer, and you are using a linear model (e.g. LinearSVC or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):

def print_top10(vectorizer, clf, class_labels):     """Prints features with the highest coefficient values, per class"""     feature_names = vectorizer.get_feature_names()     for i, class_label in enumerate(class_labels):         top10 = np.argsort(clf.coef_[i])[-10:]         print("%s: %s" % (class_label,               " ".join(feature_names[j] for j in top10))) 

This is for multiclass classification; for the binary case, I think you should use clf.coef_[0] only. You may have to sort the class_labels.

like image 104
Fred Foo Avatar answered Nov 09 '22 00:11

Fred Foo


With the help of larsmans code I came up with this code for the binary case:

def show_most_informative_features(vectorizer, clf, n=20):     feature_names = vectorizer.get_feature_names()     coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))     top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])     for (coef_1, fn_1), (coef_2, fn_2) in top:         print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2) 
like image 30
tobigue Avatar answered Nov 08 '22 23:11

tobigue