Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which classifier to choose in NLTK

I want to classify text messages into several categories like, "relation building", "coordination", "information sharing", "knowledge sharing" & "conflict resolution". I am using NLTK library to process these data. I would like to know which classifier, in nltk, is better for this particular multi-class classification problem.

I am planning to use Naive Bayes Classification, is it advisable?

like image 537
Maggie Avatar asked Jul 05 '11 16:07

Maggie


People also ask

What is classification in NLTK?

Classifiers label tokens with category labels (or class labels). Typically, labels are represented with strings (such as "health" or "sports" . In NLTK, classifiers are defined using classes that implement the ClassifierI interface, which supports the following operations: self. classify(featureset)

What is the best model for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

Does NLTK use naive Bayes?

NLTK (Natural Language Toolkit) provides Naive Bayes classifier to classify text data. In this post, we'll learn how to use NLTK Naive Bayes classifier to classify text data in Python. You can get more information about NLTK on this page.

How do you classify text into categories?

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule consists of an antecedent or pattern and a predicted category.


2 Answers

Naive Bayes is the simplest and easy to understand classifier and for that reason it's nice to use. Decision Trees with a beam search to find the best classification are not significantly harder to understand and are usually a bit better. MaxEnt and SVM tend be more complex, and SVM requires some tuning to get right.

Most important is the choice of features + the amount/quality of data you provide!

With your problem, I would focus first on ensuring you have a good training/testing dataset and also choose good features. Since you are asking this question you haven't had much experience with machine learning for NLP, so I'd say start of easy with Naive Bayes as it doesn't use complex features- you can just tokenize and count word occurrences.

EDIT: The question How do you find the subject of a sentence? and my answer are also worth looking at.

like image 138
nflacco Avatar answered Oct 07 '22 23:10

nflacco


Yes, Training a Naive Bayes Classifier for each category and then labeling each message to a class based on which Classifier provides the highest score is a standard first approach to problems like this. There are more sophisticated single class classifier algorithms which you could substitute in for Naive Bayes if you find performance inadequate, such as a Support Vector Machine ( Which I believe is available in NLTK via a Weka plug in, but not positive). Unless you can think of anything specific in this problem domain that would make Naieve Bayes especially unsuitable, its ofen the go-to "first try" for a lot of projects.

The other NLTK classifier I would consider trying would be MaxEnt as I believe it natively handles multiclass classification. (Though the multiple binary classifer approach is very standard and common as well). In any case the most important thing is to collect a very large corpus of properly tagged text messages.

If by "Text Messages" you are referring to actual cell phone text messages these tend to be very short and the language is very informal and varied, I think feature selection may end up being a larger factor in determining accuracy than classifier choice for you. For example, using a Stemmer or Lemmatizer that understands common abbreviations and idioms used, tagging part of speech or chunking , entity extraction, extracting probably relationships between terms may provide more bang than using more complex classifiers.

This paper talks about classifying Facebook status messages based on sentiment, which has some of the same issues, and may provide some insights into this. The links is to a google cache because I'm having problems w/ the original site:

http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+classifier+multiple+classes&hl=en&gl=us&pid=bl&srcid=ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN&sig=AHIEtbQN4J_XciVhVI60oyrPb4164u681w&pli=1

like image 35
bdk Avatar answered Oct 07 '22 23:10

bdk