Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems obtaining most informative features with scikit learn?

Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows:

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

Then:

most_informative_feature_for_class(tfidf_vect, clf, 5)

For this classfier:

X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values


from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
                                                    y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)

The problem is the output of most_informative_feature_for_class:

5 a_base_de_bien bastante   (0, 2451)   -0.210683496368
  (0, 3533) -0.173621065386
  (0, 8034) -0.135543062425
  (0, 10346)    -0.173621065386
  (0, 15231)    -0.154148294738
  (0, 18261)    -0.158890483047
  (0, 21083)    -0.297476572586
  (0, 434)  -0.0596263855375
  (0, 446)  -0.0753492277856
  (0, 769)  -0.0753492277856
  (0, 1118) -0.0753492277856
  (0, 1439) -0.0753492277856
  (0, 1605) -0.0753492277856
  (0, 1755) -0.0637950312345
  (0, 3504) -0.0753492277856
  (0, 3511) -0.115802483001
  (0, 4382) -0.0668983049212
  (0, 5247) -0.315713152154
  (0, 5396) -0.0753492277856
  (0, 5753) -0.0716096348446
  (0, 6507) -0.130661516772
  (0, 7978) -0.0753492277856
  (0, 8296) -0.144739048504
  (0, 8740) -0.0753492277856
  (0, 8906) -0.0753492277856
  : :
  (0, 23282)    0.418623443832
  (0, 4100) 0.385906085143
  (0, 15735)    0.207958503155
  (0, 16620)    0.385906085143
  (0, 19974)    0.0936828782325
  (0, 20304)    0.385906085143
  (0, 21721)    0.385906085143
  (0, 22308)    0.301270427482
  (0, 14903)    0.314164150621
  (0, 16904)    0.0653764031957
  (0, 20805)    0.0597723455204
  (0, 21878)    0.403750815828
  (0, 22582)    0.0226150073272
  (0, 6532) 0.525138162099
  (0, 6670) 0.525138162099
  (0, 10341)    0.525138162099
  (0, 13627)    0.278332617058
  (0, 1600) 0.326774799211
  (0, 2074) 0.310556919237
  (0, 5262) 0.176400451433
  (0, 6373) 0.290124806858
  (0, 8593) 0.290124806858
  (0, 12002)    0.282832270298
  (0, 15008)    0.290124806858
  (0, 19207)    0.326774799211

It is not returning the label nor the words. Why this is happening and how can I print the words and the labels?. Do you guys this is happening since I am using pandas to read the data?. Another thing I tried is the following, form this question:

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))


print_top10(tfidf_vect,clf,y)

But I get this traceback:

Traceback (most recent call last):

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
    print_top10(tfidf_vect,clf,5)
  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
    for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable

Any idea of how to solve this, in order to get the features with the highest coefficient values?.

like image 593
john doe Avatar asked May 03 '15 18:05

john doe


People also ask

Does Scikit-learn support deep learning?

The scikit-learn library in Python is built upon the SciPy stack for efficient numerical computation. It is a fully featured library for general machine learning and provides many useful utilities in developing deep learning models.

What method does Scikit-learn used for classifying operational data?

This section will introduce three popular classification techniques: Logistic Regression, Discriminant Analysis, and Nearest Neighbor.

Is Scikit-learn easy to learn?

If you are learning machine learning then Scikit-learn is probably the best library to start with. Its simplicity means that it is fairly easy to pick up and by learning how to use it you will also gain a good grasp of the key steps in a typical machine learning workflow.


1 Answers

To solve this specifically for linear SVM, we first have to understand the formulation of the SVM in sklearn and the differences that it has to MultinomialNB.

The reason why the most_informative_feature_for_class works for MultinomialNB is because the output of the coef_ is essentially the log probability of features given a class (and hence would be of size [nclass, n_features], due to the formulation of the naive bayes problem. But if we check the documentation for SVM, the coef_ is not that simple. Instead coef_ for (linear) SVM is [n_classes * (n_classes -1)/2, n_features] because each of the binary models are fitted to every possible class.

If we do possess some knowledge on which particular coefficient we're interested in, we could alter the function to look like the following:

def most_informative_feature_for_class_svm(vectorizer, classifier,  classlabel, n=10):
    labelid = ?? # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

This would work as intended and print out the labels and the top n features according to the coefficient vector that you're after.

As for getting the correct output for a particular class, that would depend on the assumptions and what you aim to output. I suggest reading through the multi-class documentation within the SVM documentation to get a feel for what you're after.

So using the train.txt file which was described in this question, we can get some kind of output, though in this situation it isn't particularly descriptive or helpful to interpret. Hopefully this helps you.

import codecs, re, time
from itertools import chain

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

from sklearn.svm import SVC
svcc = SVC(kernel='linear', C=1)
svcc.fit(trainset, tags)

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
print 
most_informative_feature_for_class_svm(word_vectorizer, svcc)

with output:

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

no 0.0204081632653
parecer 0.0204081632653
pone 0.0204081632653
por 0.0204081632653
relación 0.0204081632653
una 0.0204081632653
visto 0.0204081632653
ya 0.0204081632653
es 0.0408163265306
lo 0.0408163265306
like image 121
chappers Avatar answered Nov 07 '22 07:11

chappers