I'm new to text categorization techniques, I want to know the difference between the N-gram approach for text categorization and other classifier (decision tree, KNN, SVM) based text categorization.
i want to know which one is better, does n-grams comes under classifers ?. Does n-grams overcome any demerits in classifier techniques ?
where can i get comparative information regarding all this techniques.
thanks in advance.
I'll actually post a full answer to this, since I think it's worth it being obvious that you can use n-gram models as classifiers (in much the same way as you can use any probability model of your features as one).
Generative classifiers approximate the posterior of interest, p(class | test doc) as:
p(c|t) \propto p(c) p(t|c)
where p(c) is the prior probability of c and p(t|c) is the likelihood. Classification picks the arg-max over all c. An n-gram language model, just like Naive Bayes or LDA or whatever generative model you like, can be construed as a probability model p(t|c) if you estimate a separate model for each class. As such, it can provide all the information required to do classification.
The question is whether the model is any use, of course. The major issue is that n-gram models tend to be built over billions of words of text, where classifiers are often trained on a few thousand. You can do complicated stuff like putting joint priors on the parameters of all the class' models, clamping hyperparameters to be equal (what these parameters are depends on how you do smoothing)... but it's still tricky.
An alternative is to build an n-gram model of characters (including spaces/punctuation if it turns out to be useful). This can be estimated much more reliably (26^3 parameters for tri-gram model instead of ~20000^3), and can be very useful for author identification/genre classification/other forms of classification that have stylistic elements.
N-gram is not a classifier, it is a probabilistic language model, modeling sequences of basic units, where these basic units can be words, phonemes, letters, etc. N-gram is basically a probability distribution over sequences of length n, and it can be used when building a representation of a text.
A classifier is an algorithm, which may or may not use n-gram for the representation of texts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With