Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Orange vs NLTK for Content Classification in Python [closed]

We need a content classification module. Bayesian classifier seems to be what I am looking for. Should we go for Orange or NLTK ?

like image 733
philgo20 Avatar asked Jan 25 '11 02:01

philgo20


2 Answers

Well as evidenced by the documentation, the Naive Bayes implementation in each Library is easy to use, so why not run your data with both and compare the results?

Both Orange and NLTK are both mature, stable libraries (10+ years in development for each library) that originated in large universities; they share some common features primarily Machine Learning algorithms. Beyond that, they are quite different in scope, purpose, and implementation.

Orange is domain agnostic--not directed towards a particular academic discipline or commercial domain, instead it advertises itself as full-stack data mining and ML platform. It's focus is on the tools themselves and not the application of those tools in a particular discipline.

Its features include IO, the data analysis algorithm, and a data visualization canvas.

NLTK, on the other hand, began as and remains an academic project in a computational linguistics department of a large university. The task you mentioned (document content classification) and your algorithm of choice (Naive Bayesian) are pretty much right at the core of NLTK's functionality. NLTK does indeed have ML/Data Mining algorithms but its only because they have a particular utility in computational linguistics.

NLTK of course includes some ML algorithms but only because they have utility in computational linguistics, along with document parsers, tokenizers, part-of-speech analyzers, etc.--all of which comprise NLTK.

Perhaps the Naive Bayes implementation in Orange is just as good, i would still choose NLTK's implementation because it is clearly optimized for the particular task you mentioned.

There are numerous tutorials on NLTK and in particular for its Naive Bayes for use content classification. A blog post by Jim Plus and another in streamhacker.com, for instance present excellent tutorials for the use of NLTK's Naive Bayes; the second includes a line-by-line discussion of the code required to access this module. The authors of both of these posts report good results using NLTK (92% in the former, 73% in the latter).

like image 52
doug Avatar answered Sep 18 '22 16:09

doug


I don't know Orange, but +1 for NLTK:

I've successively used the classification tools in NLTK to classify text and related meta data. Bayesian is the default but there are other alternatives such as Maximum Entropy. Also being a toolkit, you can customize as you see fit - eg. creating your own features (which is what I did for the meta data).

NLTK also has a couple of good books - one of which is available under Creative Commons (as well as O'Reilly).

like image 39
winwaed Avatar answered Sep 19 '22 16:09

winwaed