Classifying Documents into Categories

Tags:

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.

I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).

My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).

Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

Here is my test class http://gist.github.com/451880

655

asked Jun 24 '10 19:06

erikcw

2 Answers

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.

Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.

Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.

If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.

For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

answered Oct 05 '22 00:10

ogrisel

How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.

Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.

Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?

Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.

We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.

answered Oct 05 '22 02:10

ephes

Related questions
                            
                                How can I achieve a self-referencing many-to-many relationship on the SQLAlchemy ORM back referencing to the same attribute?
                            
                                Python class inheritance: AttributeError: '[SubClass]' object has no attribute 'xxx'
                            
                                Beginner Python: Reading and writing to the same file
                            
                                Efficient element-wise multiplication of a matrix and a vector in TensorFlow
                            
                                Pandas: group by index value, then calculate quantile?
                            
                                Modify list and dictionary during iteration, why does it fail on dict?
                            
                                How to copy a directory and its contents to an existing location using Python?
                            
                                How do I escape % from python mysql query
                            
                                Should internal class methods return values or just modify instance variables?
                            
                                Embedding a Python library in my own package
                            
                                How to get current URL in jinja2/flask (request.url not working)
                            
                                Django multi-database routing
                            
                                Python unittest discovery with subfolders
                            
                                when to use pre_save, save, post_save in django?
                            
                                Python: Passing a class name as a parameter to a function?
                            
                                How to read a raw image using PIL?
                            
                                interpolate 3D volume with numpy and or scipy
                            
                                Python thread name doesn't show up on ps or htop
                            
                                The print of string constant is always attached with 'b' inTensorFlow [duplicate]
                            
                                How do you get Python documentation in Texinfo Info format?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Classifying Documents into Categories

Tags:

python

machine-learning

naivebayes

nlp

nltk

erikcw

People also ask

2 Answers

ogrisel

ephes

Recent Activity

Donate For Us