Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

text classification methods? SVM and decision tree

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.

i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?

what is the best method i can use for classifying docs in this way?

thanks!

like image 605
zsh Avatar asked Jul 02 '13 05:07

zsh


2 Answers

Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.

The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.

EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.

like image 82
Raff.Edward Avatar answered Nov 12 '22 11:11

Raff.Edward


  • Naive Bayes

Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.

  • KNN

KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.

  • SVM

SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.

  • Random Forest (decision tree)

I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.

FYI

These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.

Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.

And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.

like image 36
Freya Ren Avatar answered Nov 12 '22 12:11

Freya Ren