Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which classification algorithm can be used for document categorization?

Hey, Here is my problem,

Given a set of documents I need to assign each document to a predefined category.

I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.

The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.

So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".

To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly

Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.

Thanks in Advance.

like image 886
TeFa Avatar asked Aug 20 '12 01:08

TeFa


People also ask

Which algorithm is best for document classification?

It is concluded that KNN classifiers have been recognized as the best algorithm for document classification with a percentage accuracy of 99.85%, recall value of 100%, and f-Score of 0.997.

Which learning is used for a document classification problem?

Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort.


2 Answers

Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.

Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.

However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.

like image 143
amit Avatar answered Oct 16 '22 18:10

amit


This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.

like image 1
jergason Avatar answered Oct 16 '22 19:10

jergason