Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get same value for precision, recall and F score in Apache Spark Logistic regression algorithm

I have implemented a logistic regression for a classification problem. I get the same value for precision, recall and F1 score. Is it ok to have the same value? I also got this problem in implementing decision trees and random forest. There also I got same value for precision, recall and F1 score.

// Run training algorithm to build the model.
        final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
                .setNumClasses(13).
                run(data.rdd());
//Compute raw scores on the test set.
        JavaRDD<Tuple2<Object, Object>> predictionAndLabels = testData.map(
                new Function<LabeledPoint, Tuple2<Object, Object>>() {
                    public Tuple2<Object, Object> call(LabeledPoint p) {
                        Double prediction = model.predict(p.features());
                        return new Tuple2<Object, Object>(prediction, p.label());
                    }
                }
        );
// Get evaluation metrics.
        MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());
        double precision = metrics.precision();
        System.out.println("Precision = " + precision);

        double recall = metrics.recall();
        System.out.println("Recall = " + recall);

        double FScore = metrics.fMeasure();
        System.out.println("F Measure = " + FScore);
like image 669
Thamali Wijewardhana Avatar asked Mar 11 '23 11:03

Thamali Wijewardhana


2 Answers

I am also facing the same problem. I have tried decision tree, random forest and GBT. Every time, I get the same precision, recall and F1 score. The accuracy is also the same (calculated through confusion matrix).

So, I am using my own formulas and written code to get the accuracy, precision, recall, and F1 score measures.

from pyspark.ml.classification import RandomForestClassifier
from pyspark.mllib.evaluation import MulticlassMetrics

#generate model on splited dataset
rf = RandomForestClassifier(labelCol='label', featuresCol='features')
fit = rf.fit(trainingData)
transformed = fit.transform(testData)

results = transformed.select(['prediction', 'label'])
predictionAndLabels=results.rdd
metrics = MulticlassMetrics(predictionAndLabels)

cm=metrics.confusionMatrix().toArray()
accuracy=(cm[0][0]+cm[1][1])/cm.sum()
precision=(cm[0][0])/(cm[0][0]+cm[1][0])
recall=(cm[0][0])/(cm[0][0]+cm[0][1])`
print("RandomForestClassifier: accuracy,precision,recall",accuracy,precision,recall)
like image 197
Avinash Avatar answered Mar 13 '23 06:03

Avinash


You can give label=1 as an argument in precision and recall methods for binary classification. It worked for me. For multiple classification, you can try the label index of the class for which you calculate precision and recall values.

`double precision = metrics.precision(label=1);
 System.out.println("Precision = " + precision);
 double recall = metrics.recall(label=1);
 System.out.println("Recall = " + recall);
 double FScore = metrics.fMeasure();
 System.out.println("F Measure = " + FScore);`
like image 26
user25260 Avatar answered Mar 13 '23 07:03

user25260