I'm looking to implement a classifier with approximately 150 categories (probably in Java) mostly for tweets (so very small documents).Some of the classes have very similar domains eg. 'Companies', 'Competition', 'Consumers' , 'International law', 'International organisations', 'International politics and government' . What algorithm/ approach is best when such a high resolution is needed? I've tried Naive Bayes (obv) and so far it hasn't performed very well (although that could just be due to the quality of the training data). The communities thoughts would be very welcome!
Thanks,
Mark
It might be worthwhile to come up with a hierarchical classifier built from (potentially many) levels of sub-classifiers (i.e., come up with a taxonomy for your document labels).
A single classifier can output any of the many possible class labels.
A hierarchical classifier groups related class labels together, and performs additional layers of classification until a leaf node is reached (or until the confidence drops below a certain threshold).
The intuition is that the classifier will have an easier time learning discriminative features when the number of categories is fewer.
For example, a hierarchical classifier may have an easier time learning that player
is a good feature indicative of sports, whereas a single classifier would have a more difficult time if player
was only seen for one category (basketball) and not another (hockey).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With