I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text?
>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
>>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt')
>>> len(reader.categories())
234
Assuming you want a naive Bayes classifier with bag of words features:
from nltk import FreqDist
from nltk.classify.naivebayes import NaiveBayesClassifier
def make_training_data(rdr):
for c in rdr.categories():
for f in rdr.fileids(c):
yield FreqDist(rdr.words(fileids=[f])), c
clf = NaiveBayesClassifier.train(list(make_training_data(reader)))
The resulting clf
's classify
method can be used on any FreqDist
of words.
(But note: from your cap_pattern
, it seems you have sample and a single category per file in your corpus. Please check whether that's really what you want.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With