The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features()
, which is really helpful for debugging features:
viagra = None ok : spam = 4.5 : 1.0 hello = True ok : spam = 4.5 : 1.0 hello = None spam : ok = 3.3 : 1.0 viagra = True spam : ok = 3.3 : 1.0 casino = True spam : ok = 2.0 : 1.0 casino = None ok : spam = 1.5 : 1.0
My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.
If there is no such function yet, does somebody know a workaround how to get to those values?
Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.
Scikit-learn is a library that provides a variety of both supervised and unsupervised machine learning techniques. Supervised machine learning refers to the problem of inferring a function from labeled training data, and it comprises both regression and classification.
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g. LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):
def print_top10(vectorizer, clf, class_labels): """Prints features with the highest coefficient values, per class""" feature_names = vectorizer.get_feature_names() for i, class_label in enumerate(class_labels): top10 = np.argsort(clf.coef_[i])[-10:] print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top10)))
This is for multiclass classification; for the binary case, I think you should use clf.coef_[0]
only. You may have to sort the class_labels
.
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20): feature_names = vectorizer.get_feature_names() coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) for (coef_1, fn_1), (coef_2, fn_2) in top: print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With