Mahout for sentiment analysis

Tags:

Using mahout I am able to classify sentiment of data . But I am stuck with a confusion matrix.

I am using mahout 0.7 naive bayes algorithms to classify sentiment of tweets. I use trainnb and testnb naive bayes classifiers to train the classifier and classify sentiment of tweets as 'positive' ,'negative' or 'neutral'.

Sample positive training set

      'positive','i love my i phone'
      'positive' , it's pleasure to have i phone'

Similarly I have prepared training samples of negative and neutral, it is a huge data set.

The sample test data tweets I am providing is without including sentiments.

  'it is nice model'
  'simply fantastic '

I am able to run the mahout classification algorithm, and it gives output of classified instances as confusion matrix .

Next step I need to find out which tweets are showing positive sentiment and which are negative. expected output using classification: to tag text with the sentiment.

       'negative','very bad btr life time'
      'positive' , 'i phone has excellent design features'

In mahout which algorithm do I need to implement to get output in the above format. or any custom source implementation is required.

To display data 'kindly' suggest me algorithms that apache mahout provides, which will be suitable for my twitter data sentiment analysis.

544

asked Mar 07 '13 11:03

Vanitha Reddy

2 Answers

In general to classify some text you need to run Naive Bayes with different priors (positive and negative in your case) and then just chose the one that results in greater value.

This excerpt from the Mahout book has some examples. See Listing 2:

Parameters p = new Parameters();
p.set("basePath", modelDir.getCanonicalPath());9
Datastore ds = new InMemoryBayesDatastore(p);
Algorithm a = new BayesAlgorithm();
ClassifierContext ctx = new ClassifierContext(a,ds);
ctx.initialize();

....

ClassifierResult result = ctx.classifyDocument(tokens, defaultCategory);

Here result should hold either "positive" or "negative" label.

answered Sep 21 '22 16:09

Ivan Koblik

I am not sure I will be able to help you in full but I hope I will be able to give you some entry points. In general, my advice for you would be to download Mahout's source code and see how examples and target classes are implemented. This is not that easy but you should be ready that Mahout doesn't have easy entry doors. But once you enter them learning curve will be quick.

First of all, it depends on the version of Mahout you are using. I am using 0.7 myself, so my explanation will be regarding 0.7.

public void classify(String modelLocation, RawEntry unclassifiedInstanceRaw) throws IOException {

    Configuration conf = new Configuration();

    NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelLocation), conf);
    AbstractNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

    String unclassifiedInstanceFeatures = RawEntry.toNaiveBayesTrainingFormat(unclassifiedInstanceRaw);

    FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder("features");
    vectorEncoder.setProbes(1); // my features vectors are tiny

    Vector unclassifiedInstanceVector = new RandomAccessSparseVector(unclassifiedInstanceFeatures.split(" ").length());

    for (String feature: unclassifiedInstanceFeatures) {
        vectorEncoder.addToVector(feature, unclassifiedInstanceVector);
    }

    Vector classificationResult = classifier.classifyFull(unclassifiedInstanceVector);

    System.out.println(classificationResult.asFormatString());

}

What happens here:

1) First, you load the model you got by doing trainnb. This model got saved where you specified using -o parameter while calling trainnb. Model is .bin file.

2) StandardNaiveBayesClassifier is created using your model

3) RawEntry is my custom class which is just a wrapper around raw string of my data. toNaiveBayesTrainingFormar takes string I want to classify, removes noise from it based on my needs and simply returns a string of features 'word1 word2 word3 word4'. So, my unclassified raw string got converted into applicable format for classification.

4) Now string of features needs to be encoded as Mahout's Vector because classifier input is only in Vector

5) Pass vector to classifier - magic.

This is the first part. Now, classifier returns you Vector which contains classes (sentiments in your case) with probabilities. You want specific output. The most straightforward to implement (but I assume not the most efficient and stylish) would be to do next:

1) You create map reduce job which goes through all data you want to classify

2) For each instance you call classify method (don't forget to do few changes not to create StandardNaiveBayesClassifier for every instance)

3) Having classification result vector you can output data in whatever format you whish in your map reduce job

4) Useful setting here is jC.set("mapreduce.textoutputformat.separator", " "); where jC is JobConf. This allows you to choose separator for your output file from mapreduce job. In your case this is ",".

Again, this all applies to Mahout 0.7. No guarantees it will work for you as is. It worked for me though.

In general, I never worked with Mahout from command-line and for me Mahout from Java is the way to go.

answered Sep 21 '22 16:09

Jan Domozilov

Related questions
                            
                                I am getting an accuracy of 1.0 every time in neural network
                            
                                How to configure input shape for bidirectional LSTM in Keras
                            
                                QnA Maker's metadata
                            
                                fastest way to load images in python for processing
                            
                                Creating a neural network in keras to multiply two input integers
                            
                                Understanding multivariate time series classification with Keras
                            
                                How to get all the models (one for each set of parameters) using GridSearchCV?
                            
                                How to monitor validation loss in the training of estimators in TensorFlow?
                            
                                Generate larger synthetic dataset based on a smaller dataset in Python
                            
                                Keras custom loss function (elastic net)
                            
                                How to Multi-Head learning
                            
                                How to split data based on a column value in sklearn
                            
                                How can I load a partial pretrained pytorch model?
                            
                                mask 0 values during normalization
                            
                                AWS - Step functions, use execution input within a TuningStep
                            
                                Choice of Machine Learning Platform [closed]
                            
                                Clustering conceptually similar documents together?
                            
                                Which is the best document clustering open-source package?
                            
                                Effective clustering of a similarity matrix
                            
                                Getting negative alpha value in SVM using scikit package in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Mahout for sentiment analysis

Tags:

machine-learning

sentiment-analysis

mahout

Vanitha Reddy

People also ask

2 Answers

Ivan Koblik

Jan Domozilov

Recent Activity

Donate For Us