Which classification algorithm can be used for document categorization?

Tags:

Hey, Here is my problem,

Given a set of documents I need to assign each document to a predefined category.

I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.

The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.

So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".

To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly

Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.

Thanks in Advance.

886

asked Aug 20 '12 01:08

TeFa

2 Answers

Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.

Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.

However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.

143

answered Oct 16 '22 18:10

amit

This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.

answered Oct 16 '22 19:10

jergason

Related questions
                            
                                Efficiently summing log quantities
                            
                                JavaScript Merge Intersecting Rectangles
                            
                                A way to draw equidistant curve
                            
                                finding the width of a binary tree
                            
                                Algorithm to make numbers from match sticks
                            
                                Using Dijkstra's algorithm to find a path that can carry the most weight
                            
                                moving an object from point to point in a linear path
                            
                                NAudio frequency band intensity
                            
                                Luhn or Verhoeff algorithm for credit card numbers
                            
                                name of algorithm related to load balancing / re-distribution
                            
                                Difference between shifting and look-ahead
                            
                                What is the "make everyone happy" voting algorithm? [closed]
                            
                                What are some pagerank alternatives?
                            
                                Find the number of nodes of n-element heap of given height
                            
                                Remove the minimum number of blades
                            
                                Is there a simple way of parsing this text into a Map
                            
                                How to design a sequential hash-like function
                            
                                Traversal of an n-dimensional space
                            
                                Is Topological Sorting trying to sort vertices or edges?
                            
                                Consolidate IPs into ranges in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which classification algorithm can be used for document categorization?

Tags:

algorithm

machine-learning

classification

document-classification

TeFa

People also ask

2 Answers

amit

jergason

Recent Activity

Donate For Us