Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

Tags:

I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.. it doesn't consider the frequency of the words as the feature to look at ("bag-of-words").

One of the answers seems to suggest this can't be done with the built in NLTK classifiers. Is that the case? How can I do frequency/bag-of-words NB classification with NLTK?

like image 550
bgcode Avatar asked Apr 11 '12 01:04

bgcode


People also ask

What is Bag of Words in naive Bayes?

The term “bag of words” [1] is widely used as the selected document to be processed under the context of Naive Bayes while depicting the document itself as a bag and each vocabulary in the texture as the items in the bag by permitting multiple occurrences.

Does NLTK use naive Bayes?

NLTK (Natural Language Toolkit) provides Naive Bayes classifier to classify text data.

How do I use naive Bayes classifier in Python?

First Approach (In case of a single feature) Step 1: Calculate the prior probability for given class labels. Step 2: Find Likelihood probability with each attribute for each class. Step 3: Put these value in Bayes Formula and calculate posterior probability.

What is classification in NLTK?

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are: Deciding whether an email is spam or not.


1 Answers

scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though.

As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat complicated one that does TF-IDF weighting, chooses the 1000 best features based on a chi2 statistic, and then passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)

import numpy as np from nltk.probability import FreqDist from nltk.classify import SklearnClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline  pipeline = Pipeline([('tfidf', TfidfTransformer()),                      ('chi2', SelectKBest(chi2, k=1000)),                      ('nb', MultinomialNB())]) classif = SklearnClassifier(pipeline)  from nltk.corpus import movie_reviews pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')] neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')] add_label = lambda lst, lab: [(x, lab) for x in lst] classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))  l_pos = np.array(classif.classify_many(pos[100:])) l_neg = np.array(classif.classify_many(neg[100:])) print "Confusion matrix:\n%d\t%d\n%d\t%d" % (           (l_pos == 'pos').sum(), (l_pos == 'neg').sum(),           (l_neg == 'pos').sum(), (l_neg == 'neg').sum()) 

This printed for me:

Confusion matrix: 524     376 202     698 

Not perfect, but decent, considering it's not a super easy problem and it's only trained on 100/100.

like image 116
Danica Avatar answered Sep 23 '22 12:09

Danica