Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fine Text Classification - what algorithm?

I'm looking to implement a classifier with approximately 150 categories (probably in Java) mostly for tweets (so very small documents).Some of the classes have very similar domains eg. 'Companies', 'Competition', 'Consumers' , 'International law', 'International organisations', 'International politics and government' . What algorithm/ approach is best when such a high resolution is needed? I've tried Naive Bayes (obv) and so far it hasn't performed very well (although that could just be due to the quality of the training data). The communities thoughts would be very welcome!

Thanks,

Mark

like image 404
Mark Avatar asked Jan 13 '23 12:01

Mark


1 Answers

It might be worthwhile to come up with a hierarchical classifier built from (potentially many) levels of sub-classifiers (i.e., come up with a taxonomy for your document labels).

Single classifier

single classifier with many possible class labels

A single classifier can output any of the many possible class labels.

Hierarchical classifier

hierarchical classifier

A hierarchical classifier groups related class labels together, and performs additional layers of classification until a leaf node is reached (or until the confidence drops below a certain threshold).

Intuition

The intuition is that the classifier will have an easier time learning discriminative features when the number of categories is fewer.

For example, a hierarchical classifier may have an easier time learning that player is a good feature indicative of sports, whereas a single classifier would have a more difficult time if player was only seen for one category (basketball) and not another (hockey).

like image 167
Wesley Baugh Avatar answered Jan 29 '23 05:01

Wesley Baugh