Implementing Bag-of-Words Naive-Bayes classifier in NLTK

Tags:

I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.. it doesn't consider the frequency of the words as the feature to look at ("bag-of-words").

One of the answers seems to suggest this can't be done with the built in NLTK classifiers. Is that the case? How can I do frequency/bag-of-words NB classification with NLTK?

550

asked Apr 11 '12 01:04

bgcode

1 Answers

scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though.

As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat complicated one that does TF-IDF weighting, chooses the 1000 best features based on a chi2 statistic, and then passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)

import numpy as np from nltk.probability import FreqDist from nltk.classify import SklearnClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline  pipeline = Pipeline([('tfidf', TfidfTransformer()),                      ('chi2', SelectKBest(chi2, k=1000)),                      ('nb', MultinomialNB())]) classif = SklearnClassifier(pipeline)  from nltk.corpus import movie_reviews pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')] neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')] add_label = lambda lst, lab: [(x, lab) for x in lst] classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))  l_pos = np.array(classif.classify_many(pos[100:])) l_neg = np.array(classif.classify_many(neg[100:])) print "Confusion matrix:\n%d\t%d\n%d\t%d" % (           (l_pos == 'pos').sum(), (l_pos == 'neg').sum(),           (l_neg == 'pos').sum(), (l_neg == 'neg').sum())

This printed for me:

Confusion matrix: 524     376 202     698

Not perfect, but decent, considering it's not a super easy problem and it's only trained on 100/100.

116

answered Sep 23 '22 12:09

Danica

Related questions
                            
                                Can scipy.stats identify and mask obvious outliers?
                            
                                Removing Logging from Production Code in Android?
                            
                                Convert between text and varchar(MAX) in SQL Server
                            
                                Get Meteor collection by name
                            
                                Copy a directory using NSIS .
                            
                                How to find out which class I'm currently in in Pycharm?
                            
                                How to remove "fatal: loose object"?
                            
                                three-way color gradient fill in r
                            
                                Rails log too verbose
                            
                                Selenium Webdriver - click on hidden elements
                            
                                How do you make Git ignore spaces and tabs?
                            
                                How to override the queryset giving the filters in list_filter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With