Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing Features to identify Twitter Questions as "Useful"

I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.

As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc

However, I am only interested in useful questions.

I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.

As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.

I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.

Any suggestions on how I should choose the features or carry on?

Thanks.

like image 536
bili Avatar asked Jan 14 '13 03:01

bili


2 Answers

Some random suggestions.

Add a pre-processing step and remove stop-words like this, a, of, and, etc.

  How often is there a basketball fight

First you remove some stop words, you get

  how often basketball fight 

Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)

For a sentence like above, you calculate tf-idf score for each word:

  tf-idf(how)
  tf-idf(often)
  tf-idf(basketball)
  tf-idf(fight)

This might be useful.

Try below additional features for your classifier

  • average tf-idf score
  • median tf-idf score
  • max tf-idf score

Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.

>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]

Then you have possibly additional features to try that related to pos-tags.

Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.

  • whether the tweet contains any url
  • whether the tweet contains any email or phone number
  • whether there is any strong feeling such as ! follows the question.
  • whether unigram words present in the contexts of tweets.
  • whether the tweet mentions other user's name
  • whether the tweet is a retweet
  • whether the tweet contains any hashtag #

FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.

Hope they help.

like image 132
greeness Avatar answered Oct 13 '22 00:10

greeness


A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.

like image 32
Steve Avatar answered Oct 13 '22 00:10

Steve