Which classifier to choose in NLTK

Tags:

I want to classify text messages into several categories like, "relation building", "coordination", "information sharing", "knowledge sharing" & "conflict resolution". I am using NLTK library to process these data. I would like to know which classifier, in nltk, is better for this particular multi-class classification problem.

I am planning to use Naive Bayes Classification, is it advisable?

537

asked Jul 05 '11 16:07

Maggie

2 Answers

Naive Bayes is the simplest and easy to understand classifier and for that reason it's nice to use. Decision Trees with a beam search to find the best classification are not significantly harder to understand and are usually a bit better. MaxEnt and SVM tend be more complex, and SVM requires some tuning to get right.

Most important is the choice of features + the amount/quality of data you provide!

With your problem, I would focus first on ensuring you have a good training/testing dataset and also choose good features. Since you are asking this question you haven't had much experience with machine learning for NLP, so I'd say start of easy with Naive Bayes as it doesn't use complex features- you can just tokenize and count word occurrences.

EDIT: The question How do you find the subject of a sentence? and my answer are also worth looking at.

138

answered Oct 07 '22 23:10

nflacco

Yes, Training a Naive Bayes Classifier for each category and then labeling each message to a class based on which Classifier provides the highest score is a standard first approach to problems like this. There are more sophisticated single class classifier algorithms which you could substitute in for Naive Bayes if you find performance inadequate, such as a Support Vector Machine ( Which I believe is available in NLTK via a Weka plug in, but not positive). Unless you can think of anything specific in this problem domain that would make Naieve Bayes especially unsuitable, its ofen the go-to "first try" for a lot of projects.

The other NLTK classifier I would consider trying would be MaxEnt as I believe it natively handles multiclass classification. (Though the multiple binary classifer approach is very standard and common as well). In any case the most important thing is to collect a very large corpus of properly tagged text messages.

If by "Text Messages" you are referring to actual cell phone text messages these tend to be very short and the language is very informal and varied, I think feature selection may end up being a larger factor in determining accuracy than classifier choice for you. For example, using a Stemmer or Lemmatizer that understands common abbreviations and idioms used, tagging part of speech or chunking , entity extraction, extracting probably relationships between terms may provide more bang than using more complex classifiers.

This paper talks about classifying Facebook status messages based on sentiment, which has some of the same issues, and may provide some insights into this. The links is to a google cache because I'm having problems w/ the original site:

http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+classifier+multiple+classes&hl=en&gl=us&pid=bl&srcid=ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN&sig=AHIEtbQN4J_XciVhVI60oyrPb4164u681w&pli=1

answered Oct 07 '22 23:10

bdk

Related questions
                            
                                What should be the word vectors of token <pad>, <unknown>, <go>, <EOS> before sent into RNN?
                            
                                Is there a way to programmatically combine Korean unicode into one?
                            
                                Text classification using Keras: How to add custom features?
                            
                                Embedding 3D data in Pytorch
                            
                                'No module named spacy' in ipython, but works fine in regular python interpretter
                            
                                How can I untokenize a spacy.tokens.token.Token?
                            
                                Named Entity Recognition in aspect-opinion extraction using dependency rule matching
                            
                                Extracting Country Name from Author Affiliations
                            
                                NLTK/NLP buliding a many-to-many/multi-label subject classifier
                            
                                Realtime tracking of top 100 twitter words per min/hour/day
                            
                                Disease named entity recognition
                            
                                SPARQL queries with relational operator
                            
                                Using counts and tfidf as features with scikit learn
                            
                                Load Custom Dataset (which is like 20 news group set) in Scikit for Classification of text documents
                            
                                Python NLP British English vs American English
                            
                                How to evaluate Word2Vec model
                            
                                What is the difference between Sentence Encodings and Contextualized Word Embeddings?
                            
                                How to choose a Feature Selection Algorithm? - advice
                            
                                Trying to use HPSG PET Parser
                            
                                Stanford Parser: how to extract dependencies?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which classifier to choose in NLTK

Tags:

classification

nlp

nltk

Maggie

People also ask

2 Answers

nflacco

bdk

Recent Activity

Donate For Us