I'm experimenting with deriving sentiment from Twitter using Stanford's CoreNLP library, a la https://www.openshift.com/blogs/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java - so see here for the code that I'm implementing.
I am getting results, but I've noticed that there appears to be a bias towards 'negative' results, both in my target dataset and another dataset I use with ground truth - the Sanders Analytics Twitter Sentiment Corpus http://www.sananalytics.com/lab/twitter-sentiment/ - even though the ground truth data do not have this bias.
I'm posting this question on the off chance that someone else has experienced this and/or may know if this is the result of something I've done or some bug in the CoreNLP code.
(edit - sorry it took me so long to respond) I am posting links to plots showing what I mean. I don't have enough reputation to post the images, and can only include two links in this post, so I'll add the links in the comments.
I'd like to suggest this is simply a domain mismatch. The Stanford RNTN is trained on movie review snippets and you are testing on twitter data. Other than the topics mismatch, tweets also tend to be ungrammatical and use abbreviated ("creative") language. If I had to suggest a more concrete reason, I would start with a lexical mismatch. Perhaps negative emotions are expressed in a domain-independent way, e.g. with common adjectives, and positive emotions are more domain-dependent or more subtle.
It's still interesting that you're getting a negative bias. The Polyanna hypothesis suggests a positive bias, IMHO.
Going beyond your original question, there are several approaches to do sentiment analysis specifically on microblogging data. See e.g. "The Good, The Bad and the OMG!" by Kouloumpis et al.
Michael Haas points out correctly that there is a domain mismatch, which is also specified by Richard Socher in the comments section.
Sentences with a lot of unknown words and imperfect punctuation get flagged as negative.
If you are using Python, VADER is a great tool for twitter sentiment analysis. It is a rule based tool with only ~300 lines of code and a custom made lexicon for twitter, which has ~8000 words including slangs and emoticons.
It is easy to modify the rules as well as the lexicon, without any need for re-training. It is fully free and open source.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With