Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Store most informative features from NLTK NaiveBayesClassifier in a list

i am trying this Naive Bayes Classifier in python:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)

i have the following output:

Console Output

It is clearly visible which words appear more in "important" and which in "spam" category.. But I can't work with these values.. I actually want a list that looks like this:

[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]

I am new to python and having a hard time figuring all these out, can anyone help ? I will be very thankful.

like image 257
Romy Gomes Avatar asked Mar 23 '17 08:03

Romy Gomes


2 Answers

You could slightly modify the source code of show_most_informative_features to suit your purpose.

The first element of the sub-list corresponds to the most informative feature name while the second element corresponds to it's label (more specifically the label associated with numerator term of the ratio).

helper function:

def show_most_informative_features_in_list(classifier, n=10):
    """
    Return a nested list of the "most informative" features 
    used by the classifier along with it's predominant labels
    """
    cpdist = classifier._feature_probdist       # probability distribution for feature values given labels
    feature_list = []
    for (fname, fval) in classifier.most_informative_features(n):
        def labelprob(l):
            return cpdist[l, fname].prob(fval)
        labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()], 
                        key=labelprob)
        feature_list.append([fname, labels[-1]])
    return feature_list

Testing this on a classifier trained over the positive/negative movie review corpus of nltk:

show_most_informative_features_in_list(classifier, 10)

produces:

[['outstanding', 'pos'],
 ['ludicrous', 'neg'],
 ['avoids', 'pos'],
 ['astounding', 'pos'],
 ['idiotic', 'neg'],
 ['atrocious', 'neg'],
 ['offbeat', 'pos'],
 ['fascination', 'pos'],
 ['symbol', 'pos'],
 ['animators', 'pos']]
like image 98
Nickil Maveli Avatar answered Nov 06 '22 01:11

Nickil Maveli


Simply use the most_informative_features()

Using the examples from Classification using movie review corpus in NLTK/Python :

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

Then, simply:

print classifier.most_informative_features()

[out]:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True)]

And to list all features:

classifier.most_informative_features(n=len(word_features))

[out]:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True),
 ('commercially', True),
 ('utilize', True),
 ('gratuitous', True),
 ('visible', True),
 ('internet', True),
 ('disillusioned', True),
 ('boost', True),
 ('preventing', True),
 ('built', True),
 ('repairs', True),
 ('overplaying', True),
 ('election', True),
 ('caterer', True),
 ('decks', True),
 ('retiring', True),
 ('pivot', True),
 ('outwitting', True),
 ('solace', True),
 ('benches', True),
 ('terrorizes', True),
 ('billboard', True),
 ('catalogue', True),
 ('clean', True),
 ('skits', True),
 ('nice', True),
 ('feature', True),
 ('must', True),
 ('withdrawn', True),
 ('indulgence', True),
 ('tribal', True),
 ('freeman', True),
 ('must', False),
 ('nice', False),
 ('feature', False),
 ('gratuitous', False),
 ('turturro', False),
 ('built', False),
 ('internet', False),
 ('rescued', False),
 ('clean', False),
 ('overacts', False),
 ('gregor', False),
 ('conflicted', False),
 ('taboo', False),
 ('inhabiting', False),
 ('utilize', False),
 ('churns', False),
 ('boost', False),
 ('stepdaughter', False),
 ('complementary', False),
 ('gleiberman', False),
 ('skylar', False),
 ('kirkpatrick', False),
 ('hardship', False),
 ('election', False),
 ('inform', False),
 ('disillusioned', False),
 ('visible', False),
 ('commercially', False),
 ('frosted', False),
 ('pup', False),
 ('apologizing', False),
 ('freeman', False),
 ('preventing', False),
 ('nutsy', False),
 ('intrinsics', False),
 ('somalia', False),
 ('coordinators', False),
 ('strengthening', False),
 ('impatience', False),
 ('subtely', False),
 ('426', False),
 ('schreber', False),
 ('brimley', False),
 ('motherload', False),
 ('creepily', False),
 ('perturbed', False),
 ('accountants', False),
 ('beringer', False),
 ('scrubs', False),
 ('1830s', False),
 ('analogue', False),
 ('espouses', False),
 ('xv', False),
 ('skits', False),
 ('solace', False),
 ('reduncancy', False),
 ('parenthood', False),
 ('insulators', False),
 ('mccoll', False)]

To clarify:

>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool

Further clarification, if the labels used in the feature set is a string, the most_informative_features() will return a string, e.g.

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative'  for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

And:

>>> classifier.most_informative_features(10)
[('turturro', 'positive'),
 ('inhabiting', 'positive'),
 ('conflicted', 'positive'),
 ('taboo', 'positive'),
 ('overacts', 'positive'),
 ('rescued', 'positive'),
 ('stepdaughter', 'positive'),
 ('pup', 'positive'),
 ('apologizing', 'positive'),
 ('inform', 'positive')]

>>> type(classifier.most_informative_features(10)[0][1])
str
like image 2
alvas Avatar answered Nov 06 '22 02:11

alvas