Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classifying Multinomial Naive Bayes Classifier with Python Example

I am looking for a simple example on how to run a Multinomial Naive Bayes Classifier. I came across this example from StackOverflow:

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

import numpy as np
from nltk.probability import FreqDist
from nltk.classify import SklearnClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfTransformer()),
                     ('chi2', SelectKBest(chi2, k=1000)),
                     ('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

from nltk.corpus import movie_reviews
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]
add_label = lambda lst, lab: [(x, lab) for x in lst]
#Original code from thread:
#classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))
classif.train(add_label(pos, 'pos') + add_label(neg, 'neg'))#Made changes here

#Original code from thread:    
#l_pos = np.array(classif.batch_classify(pos[100:]))
#l_neg = np.array(classif.batch_classify(neg[100:]))
l_pos = np.array(classif.batch_classify(pos))#Made changes here
l_neg = np.array(classif.batch_classify(neg))#Made changes here
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
          (l_pos == 'pos').sum(), (l_pos == 'neg').sum(),
          (l_neg == 'pos').sum(), (l_neg == 'neg').sum())

I received a warning after running this example.

C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn\feature_selection\univariate_selection.py:327: 
UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, 
or you used a classification score for a regression task.
warn("Duplicate scores. Result may depend on feature ordering."

Confusion matrix:
876 124
63  937

So, my questions are..

  1. Can anyone tell me what does this error message means?
  2. I made some changes to the original code but why is the confusion matrix results so much higher than the one in the original thread?
  3. How can I test the accuracy of this of this classifier?
like image 906
Cryssie Avatar asked Oct 04 '22 15:10

Cryssie


2 Answers

The original code trains on the first 100 examples of positive and negative and then classifies the remainder. You have removed the boundary and used each example in both the training and classification phase, in other words, you have duplicated features. To fix this, split the data set into two sets, train and test.

The confusion matrix is higher (or different) because you are training on different data.

The confusion matrix is a measure of accuracy and shows the number of false positives etc. Read more here: http://en.wikipedia.org/wiki/Confusion_matrix

like image 186
Spaceghost Avatar answered Oct 07 '22 22:10

Spaceghost


I used the original code with only the first 100 entries for the training set and still had that warning. My output was:

In [6]: %run testclassifier.py
C:\Users\..\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\feature_selection\univariate_selecti
on.py:319: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, o
r you used a classification score for a regression task.
  warn("Duplicate scores. Result may depend on feature ordering."
Confusion matrix:
427     473
132     768
like image 44
frank t Avatar answered Oct 07 '22 20:10

frank t