I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.. it doesn't consider the frequency of the words as the feature to look at ("bag-of-words").
One of the answers seems to suggest this can't be done with the built in NLTK classifiers. Is that the case? How can I do frequency/bag-of-words NB classification with NLTK?
The term “bag of words” [1] is widely used as the selected document to be processed under the context of Naive Bayes while depicting the document itself as a bag and each vocabulary in the texture as the items in the bag by permitting multiple occurrences.
NLTK (Natural Language Toolkit) provides Naive Bayes classifier to classify text data.
First Approach (In case of a single feature) Step 1: Calculate the prior probability for given class labels. Step 2: Find Likelihood probability with each attribute for each class. Step 3: Put these value in Bayes Formula and calculate posterior probability.
Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are: Deciding whether an email is spam or not.
scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though.
As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat complicated one that does TF-IDF weighting, chooses the 1000 best features based on a chi2 statistic, and then passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)
import numpy as np from nltk.probability import FreqDist from nltk.classify import SklearnClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline pipeline = Pipeline([('tfidf', TfidfTransformer()), ('chi2', SelectKBest(chi2, k=1000)), ('nb', MultinomialNB())]) classif = SklearnClassifier(pipeline) from nltk.corpus import movie_reviews pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')] neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')] add_label = lambda lst, lab: [(x, lab) for x in lst] classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg')) l_pos = np.array(classif.classify_many(pos[100:])) l_neg = np.array(classif.classify_many(neg[100:])) print "Confusion matrix:\n%d\t%d\n%d\t%d" % ( (l_pos == 'pos').sum(), (l_pos == 'neg').sum(), (l_neg == 'pos').sum(), (l_neg == 'neg').sum())
This printed for me:
Confusion matrix: 524 376 202 698
Not perfect, but decent, considering it's not a super easy problem and it's only trained on 100/100.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With