I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.
As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc
However, I am only interested in useful questions.
I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.
As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.
I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.
Any suggestions on how I should choose the features or carry on?
Thanks.
Some random suggestions.
this
, a
, of
, and
, etc.How often is there a basketball fight
First you remove some stop words, you get
how often basketball fight
For a sentence like above, you calculate tf-idf score for each word:
tf-idf(how)
tf-idf(often)
tf-idf(basketball)
tf-idf(fight)
This might be useful.
>>> import nltk >>> text = nltk.word_tokenize(" How often is there a basketball fight") >>> nltk.pos_tag(text) [('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]
Then you have possibly additional features to try that related to pos-tags.
!
follows the question.#
FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.
Hope they help.
A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With