Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP software for classification of large datasets

Tags:

nlp

nltk

Background

For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.

There are three types of categorization done for each item:

  1. 30 categories, where each item must belong to one category, and at most two categories.
  2. 10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
  3. 4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.

Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).

My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.

The Question

What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?

Those I've tried so far:

  • NLTK
  • TIMBL

I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.

Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.

I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.

About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.

like image 490
Animism Avatar asked Aug 30 '11 19:08

Animism


1 Answers

After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.

like image 170
Skarab Avatar answered Oct 01 '22 02:10

Skarab