Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to incrementally train an nltk classifier

Tags:

python

nltk

I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.

If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?

I'm open to suggestions including other classifiers that can accept new training data over time.

like image 635
Rog Avatar asked Feb 05 '11 05:02

Rog


1 Answers

There's 2 options that I know of:

1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.

2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.

like image 81
Jacob Avatar answered Oct 20 '22 02:10

Jacob