Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

N-grams vs other classifiers in text categorization

I'm new to text categorization techniques, I want to know the difference between the N-gram approach for text categorization and other classifier (decision tree, KNN, SVM) based text categorization.

i want to know which one is better, does n-grams comes under classifers ?. Does n-grams overcome any demerits in classifier techniques ?

where can i get comparative information regarding all this techniques.

thanks in advance.

like image 721
wudpecker Avatar asked Dec 01 '13 18:12

wudpecker


2 Answers

I'll actually post a full answer to this, since I think it's worth it being obvious that you can use n-gram models as classifiers (in much the same way as you can use any probability model of your features as one).

Generative classifiers approximate the posterior of interest, p(class | test doc) as:

p(c|t) \propto p(c) p(t|c)

where p(c) is the prior probability of c and p(t|c) is the likelihood. Classification picks the arg-max over all c. An n-gram language model, just like Naive Bayes or LDA or whatever generative model you like, can be construed as a probability model p(t|c) if you estimate a separate model for each class. As such, it can provide all the information required to do classification.

The question is whether the model is any use, of course. The major issue is that n-gram models tend to be built over billions of words of text, where classifiers are often trained on a few thousand. You can do complicated stuff like putting joint priors on the parameters of all the class' models, clamping hyperparameters to be equal (what these parameters are depends on how you do smoothing)... but it's still tricky.

An alternative is to build an n-gram model of characters (including spaces/punctuation if it turns out to be useful). This can be estimated much more reliably (26^3 parameters for tri-gram model instead of ~20000^3), and can be very useful for author identification/genre classification/other forms of classification that have stylistic elements.

like image 84
Ben Allison Avatar answered Oct 16 '22 11:10

Ben Allison


N-gram is not a classifier, it is a probabilistic language model, modeling sequences of basic units, where these basic units can be words, phonemes, letters, etc. N-gram is basically a probability distribution over sequences of length n, and it can be used when building a representation of a text.

A classifier is an algorithm, which may or may not use n-gram for the representation of texts.

like image 34
Itamar Katz Avatar answered Oct 16 '22 11:10

Itamar Katz