I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880
Typically, there are four classifications for data: public, internal-only, confidential, and restricted.
Document classification is the act of labeling – or tagging – documents using categories, depending on their content. Document classification can be manual (as it is in library science) or automated (within the field of computer science), and is used to easily sort and manage texts, images or videos.
It is concluded that KNN classifiers have been recognized as the best algorithm for document classification with a percentage accuracy of 99.85%, recall value of 100%, and f-Score of 0.997.
You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).
How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With