I am looking for a simple example on how to run a Multinomial Naive Bayes Classifier. I came across this example from StackOverflow:
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
import numpy as np
from nltk.probability import FreqDist
from nltk.classify import SklearnClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=1000)),
('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
from nltk.corpus import movie_reviews
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]
add_label = lambda lst, lab: [(x, lab) for x in lst]
#Original code from thread:
#classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))
classif.train(add_label(pos, 'pos') + add_label(neg, 'neg'))#Made changes here
#Original code from thread:
#l_pos = np.array(classif.batch_classify(pos[100:]))
#l_neg = np.array(classif.batch_classify(neg[100:]))
l_pos = np.array(classif.batch_classify(pos))#Made changes here
l_neg = np.array(classif.batch_classify(neg))#Made changes here
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
(l_pos == 'pos').sum(), (l_pos == 'neg').sum(),
(l_neg == 'pos').sum(), (l_neg == 'neg').sum())
I received a warning after running this example.
C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn\feature_selection\univariate_selection.py:327:
UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features,
or you used a classification score for a regression task.
warn("Duplicate scores. Result may depend on feature ordering."
Confusion matrix:
876 124
63 937
So, my questions are..
The original code trains on the first 100 examples of positive and negative and then classifies the remainder. You have removed the boundary and used each example in both the training and classification phase, in other words, you have duplicated features. To fix this, split the data set into two sets, train and test.
The confusion matrix is higher (or different) because you are training on different data.
The confusion matrix is a measure of accuracy and shows the number of false positives etc. Read more here: http://en.wikipedia.org/wiki/Confusion_matrix
I used the original code with only the first 100 entries for the training set and still had that warning. My output was:
In [6]: %run testclassifier.py
C:\Users\..\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\feature_selection\univariate_selecti
on.py:319: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, o
r you used a classification score for a regression task.
warn("Duplicate scores. Result may depend on feature ordering."
Confusion matrix:
427 473
132 768
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With