Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evaluation Metrics for Binary Classification in Spark: AUC and PR curve

Tags:

I was trying to calculate Precision, Recall by Threshold for LogisticRegressionwithLBFGS using BinaryclassificationMetrics. I got all those. I was trying to figure out if I could get a graphical output of PR and AUC curve.

Pasting my Codes below:

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}



object log_reg_eval_metric {

  def main(args: Array[String]): Unit = {


    System.setProperty("hadoop.home.dir", "c:\\winutil\\")


    val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]"))

    val sqlContext = new org.apache.spark.sql.SQLContext(sc);

    val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv")


    val parsedData = data.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }

    //Splitting the data
    val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
    val training: RDD[LabeledPoint] = splits(0).cache()
    val test: RDD[LabeledPoint] = splits(1)



    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(2)
      .run(training)
    // Clear the prediction threshold so the model will return probabilities
    model.clearThreshold

    // Compute raw scores on the test set
    val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
      val prediction = model.predict(features)
      (prediction, label)
    }

    // Instantiate metrics object
    val metrics = new BinaryClassificationMetrics(predictionAndLabels)

    // Precision by threshold
    val precision = metrics.precisionByThreshold
    precision.foreach { case (t, p) =>
      println(s"Threshold: $t, Precision: $p")
    }


    // Precision-Recall Curve
    val PRC = metrics.pr

    print(PRC)



  }
}

output from print(PRC):

UnionRDD[39] at union at BinaryClassificationMetrics.scala:108

I am not sure what is an union RDD and how to use it. Is there any other way to get the graphical output. Doing my research on it. Any suggestion would be great.

like image 820
PARTHA TALUKDER Avatar asked May 26 '16 13:05

PARTHA TALUKDER


1 Answers

You can use BinaryLogisticRegressionTrainingSummary from spark.ml package.It provides PR and ROC values out of box as dataframes.

You can input these values to any rendering utility to see the specific curves.(Any multiline plot with x and y values will display the curves.)

like image 119
Prashant N Avatar answered Sep 28 '22 04:09

Prashant N