Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sentiment Analysis on LARGE collection of online conversation text

The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).

The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.

So, what I would like to know:

1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?

2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?

As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.

P.S. I have permission to analyze this data (in case it matters)

like image 444
araisbec Avatar asked Mar 10 '13 19:03

araisbec


People also ask

What is big data sentiment analysis?

Sentiment analysis [SA] is a computational analysis of sentiments or opinions, emotions, views, subjectivity expressed in text or associated with big data such as reviews, blogs, discussions, news, comments, feedback etc., about things such as electronic products, movies, public or private services, organizations, ...

What are the different types of sentiment analysis according to the text source?

Existing approaches to sentiment analysis can be grouped into three main categories: knowledge-based techniques, statistical methods, and hybrid approaches.

What is sentiment analysis of text?

Sentiment analysis (opinion mining) is a text mining technique that uses machine learning and natural language processing (nlp) to automatically analyze text for the sentiment of the writer (positive, negative, neutral, and beyond).


2 Answers

Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.

If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).

All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.

I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.

There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.

For determining the topic, many have used simple keywords but there are some more complex methods available.

like image 80
Jared Avatar answered Oct 12 '22 11:10

Jared


You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.

Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.

like image 25
Suzana Avatar answered Oct 12 '22 11:10

Suzana