There are ten main sources from which relevant sentiment analysis data can be gathered for analyzing sentiment. These include news, public information, social media, customer reviews, customer service call center data, employee interaction data, electronic health records, and more. Let's review them in detail.
Stanford Sentiment Treebank The first dataset for sentiment analysis we would like to share is the Stanford Sentiment Treebank. The dataset contains user sentiment from Rotten Tomatoes, a great movie review website.
http://www.cs.cornell.edu/home/llee/data/
http://mpqa.cs.pitt.edu/corpora/mpqa_corpus
You can use twitter, with its smileys, like this: http://web.archive.org/web/20111119181304/http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf
Hope that gets you started. There's more in the literature, if you're interested in specific subtasks like negation, sentiment scope, etc.
To get a focus on companies, you might pair a method with topic detection, or cheaply just a lot of mentions of a given company. Or you could get your data annotated by Mechanical Turkers.
This is a list I wrote a few weeks ago, from my blog. Some of these datasets have been recently included in the NLTK Python platform.
Opinion Lexicon by Bing Liu
MPQA Subjectivity Lexicon
SentiWordNet
Harvard General Inquirer
Linguistic Inquiry and Word Counts (LIWC)
Vader Lexicon
MPQA Datasets
NOTES: GNU Public License.
Sentiment140 (Tweets)
STS-Gold (Tweets)
Customer Review Dataset (Product reviews)
Included in the NLTK Python platform
Pros and Cons Dataset (Pros and cons sentences)
<pros>
or <cons>
Included in the NLTK Python platform
Comparative Sentences (Reviews)
Included in the NLTK Python platform
Sanders Analytics Twitter Sentiment Corpus (Tweets)
5513 hand-classified tweets wrt 4 different topics. Because of Twitter’s ToS, a small Python script is included to download all of the tweets. The sentiment classifications themselves are provided free of charge and without restrictions. They may be used for commercial products. They may be redistributed. They may be modified.
Spanish tweets (Tweets)
SemEval 2014 (Tweets)
You MUST NOT re-distribute the tweets, the annotations or the corpus obtained (from the readme file)
Various Datasets (Reviews)
Various Datasets #2 (Reviews)
References:
Here are a few more;
http://inclass.kaggle.com/c/si650winter11
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
If you have some resources (media channels, blogs, etc) about the domain you want to explore, you can create your own corpus. I do this in python:
Creating corpus is a hard work of pre-processing, checking, tagging, etc, but has the benefits of preparing a model for a specific domain many times increasing the accuracy. If you can get already prepared corpus, just go ahead with the sentiment analysis ;)
I'm not aware of any such corpus being freely available, but you could try an unsupervised method on an unlabeled dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With