Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text classification in python - (NLTK Sentence based)

I need to classify text and i am using Text blob python module to achieve it.I can use either Naive Bayes classifier/Decision tree. I am concern about the below mentioned points.

1) I Need to classify sentences as argument/ Not an argument. I am using two classifiers and training the model using apt data sets. My question is all about do i need to train the model with only keywords ? or i can train the data set with all possible argument and non argument sample sentences? Which would be the best approach in terms of text classification accuracy and time to retrieve?

2) Since the classification would be either argument/not an argument, which classifier would fetch exact results? It is Naive Bayes /Decision tree/Positive Naive bayes?

Thanks in advance.

like image 252
sreram Avatar asked Apr 20 '14 04:04

sreram


People also ask

What is sentence classification in NLP?

Sentence classification is one of the simplest NLP tasks that have a wide range of applications including document classification, spam filtering, and sentiment analysis. A sentence is classified into a class in sentence classification.

What is classification in NLTK?

Classifiers label tokens with category labels (or class labels). Typically, labels are represented with strings (such as "health" or "sports" . In NLTK, classifiers are defined using classes that implement the ClassifierI interface, which supports the following operations: self.


1 Answers

Ideally, it is said that the more you train your data, the 'better' your results are but it really depends after you've tested it and compared it to the real results you've prepared.

So to answer your question, training the model with keywords may give you too broad results that may not be arguments. But really, you have to compare it to something, so I suggest you might want to also train your model with some sentence structure that arguments seem to follow (a pattern of some sort), it might eliminate the ones that are not arguments. Again, do this and then test it to see if you get higher accuracy than the previous model.

To answer your next question: Which would be the best approach in terms of text classification accuracy and time to retrieve? It really depends on the data your using, I can't really answer this question because you have to perform cross-validation to see if your model achieves high accuracy. Obviously, the more features you are looking, the poorer your learning algorithm's performance. And if you are dealing with gigabytes of text to analyze, I suggest using Mapreduce to perform this job.

You might want to check out SVMs as your learning model, test it out with the learning models (naive bayes, positive naive bayes and decision trees) and see which one performs better.

Hope this helps.

like image 92
macmania314 Avatar answered Sep 23 '22 04:09

macmania314