Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tweet Classifier Feature-Selection NLTK

I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.

In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.

I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:

Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.

If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.

So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?

EDIT: I'm trying to classify into three groups: positive, negative, and neutral. Also, I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?

like image 853
elliottbolzan Avatar asked Nov 05 '22 08:11

elliottbolzan


1 Answers

The blog post you links to describes the show_most_informative_features method, but the NaiveBayesClassifier also has a most_informative_features method that returns the features rather than just printing them. You could simply set a cutoff based on your training set- features like "the", "and" and other unimportant features would be at the bottom of the list in terms of informativeness.

It's true that this approach could be subject to overfitting (some features would be much more important in your training set than in your test set), but that would be true of anything that filters features based on your training set.

like image 93
David Robinson Avatar answered Nov 10 '22 06:11

David Robinson