How can we get model metrics when training a random forest binary classifier model in Spark Mllib (F score, AUROC, AUPRC etc.)?
The issue is that BinaryClassificationMetrics
takes probabilities while the predict method of a RandomForest classifier returns discrete values 0 or 1.
See: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
A RandomForest.trainClassifier
does not have any clearThreshold
method which would make it return probabilities instead of discrete 0 or 1 labels.
We need to use the new ml
DataFrames based API to get the probabilities instead of the RDD based mllib
API.
Update
Following is updated example from Spark documentation to use a BinaryClassificationEvaluator
and display the metrics: Area Under Receiver Operating Characteristic
(AUROC) and Area Under Precision Recall Curve
(AUPRC).
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions
.select("indexedLabel", "rawPrediction", "prediction")
.show()
val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
.setLabelCol("indexedLabel")
.setRawPredictionCol("rawPrediction")
def printlnMetric(metricName: String): Unit = {
println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}
printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With