i am trying this Naive Bayes Classifier in python:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)
i have the following output:
Console Output
It is clearly visible which words appear more in "important" and which in "spam" category.. But I can't work with these values.. I actually want a list that looks like this:
[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]
I am new to python and having a hard time figuring all these out, can anyone help ? I will be very thankful.
You could slightly modify the source code of show_most_informative_features
to suit your purpose.
The first element of the sub-list corresponds to the most informative feature name while the second element corresponds to it's label (more specifically the label associated with numerator term of the ratio).
helper function:
def show_most_informative_features_in_list(classifier, n=10):
"""
Return a nested list of the "most informative" features
used by the classifier along with it's predominant labels
"""
cpdist = classifier._feature_probdist # probability distribution for feature values given labels
feature_list = []
for (fname, fval) in classifier.most_informative_features(n):
def labelprob(l):
return cpdist[l, fname].prob(fval)
labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()],
key=labelprob)
feature_list.append([fname, labels[-1]])
return feature_list
Testing this on a classifier trained over the positive/negative movie review corpus of nltk
:
show_most_informative_features_in_list(classifier, 10)
produces:
[['outstanding', 'pos'],
['ludicrous', 'neg'],
['avoids', 'pos'],
['astounding', 'pos'],
['idiotic', 'neg'],
['atrocious', 'neg'],
['offbeat', 'pos'],
['fascination', 'pos'],
['symbol', 'pos'],
['animators', 'pos']]
Simply use the most_informative_features()
Using the examples from Classification using movie review corpus in NLTK/Python :
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
Then, simply:
print classifier.most_informative_features()
[out]:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True)]
And to list all features:
classifier.most_informative_features(n=len(word_features))
[out]:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True),
('commercially', True),
('utilize', True),
('gratuitous', True),
('visible', True),
('internet', True),
('disillusioned', True),
('boost', True),
('preventing', True),
('built', True),
('repairs', True),
('overplaying', True),
('election', True),
('caterer', True),
('decks', True),
('retiring', True),
('pivot', True),
('outwitting', True),
('solace', True),
('benches', True),
('terrorizes', True),
('billboard', True),
('catalogue', True),
('clean', True),
('skits', True),
('nice', True),
('feature', True),
('must', True),
('withdrawn', True),
('indulgence', True),
('tribal', True),
('freeman', True),
('must', False),
('nice', False),
('feature', False),
('gratuitous', False),
('turturro', False),
('built', False),
('internet', False),
('rescued', False),
('clean', False),
('overacts', False),
('gregor', False),
('conflicted', False),
('taboo', False),
('inhabiting', False),
('utilize', False),
('churns', False),
('boost', False),
('stepdaughter', False),
('complementary', False),
('gleiberman', False),
('skylar', False),
('kirkpatrick', False),
('hardship', False),
('election', False),
('inform', False),
('disillusioned', False),
('visible', False),
('commercially', False),
('frosted', False),
('pup', False),
('apologizing', False),
('freeman', False),
('preventing', False),
('nutsy', False),
('intrinsics', False),
('somalia', False),
('coordinators', False),
('strengthening', False),
('impatience', False),
('subtely', False),
('426', False),
('schreber', False),
('brimley', False),
('motherload', False),
('creepily', False),
('perturbed', False),
('accountants', False),
('beringer', False),
('scrubs', False),
('1830s', False),
('analogue', False),
('espouses', False),
('xv', False),
('skits', False),
('solace', False),
('reduncancy', False),
('parenthood', False),
('insulators', False),
('mccoll', False)]
To clarify:
>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool
Further clarification, if the labels used in the feature set is a string, the most_informative_features()
will return a string, e.g.
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
And:
>>> classifier.most_informative_features(10)
[('turturro', 'positive'),
('inhabiting', 'positive'),
('conflicted', 'positive'),
('taboo', 'positive'),
('overacts', 'positive'),
('rescued', 'positive'),
('stepdaughter', 'positive'),
('pup', 'positive'),
('apologizing', 'positive'),
('inform', 'positive')]
>>> type(classifier.most_informative_features(10)[0][1])
str
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With