Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model

I have a confusion regarding BinaryClassificationMetrics (Mllib) inputs. As per Apache Spark 1.6.0, we need to pass predictedandlabel of Type (RDD[(Double,Double)]) from transformed DataFrame that having predicted, probability(vector) & rawPrediction(vector).

I have created RDD[(Double,Double)] from Predicted and label columns. After performing BinaryClassificationMetrics evaluation on NavieBayesModel, I'm able to retrieve ROC, PR etc. But the values are limited, I can't able plot the curve using the value generated from this. Roc contains 4 values and PR contains 3 value.

Is it the right way of preparing PredictedandLabel or do I need to use rawPrediction column or Probability column instead of Predicted column?

like image 441
Desanth pv Avatar asked Aug 01 '16 09:08

Desanth pv


1 Answers

Prepare like this:

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}

val df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)

val preds = predictions.select("probability", "label").rdd.map(row => 
  (row.getAs[Vector](0)(0), row.getAs[Double](1)))

And evaluate:

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

new BinaryClassificationMetrics(preds, 10).roc

If predictions are only 0 or 1 number of buckets can be lower like in your case. Try more complex data like this:

val anotherPreds = df1.select(rand(), $"label").rdd.map(row => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc
like image 92
user6022341 Avatar answered Oct 05 '22 07:10

user6022341