I'm wondering what the best way is to evaluate a fitted binary classification model using Apache Spark 2.4.5 and PySpark (Python). I want to consider different metrics such as accuracy, precision, recall, auc and f1 score.
Let us assume that the following is given:
# pyspark.sql.dataframe.DataFrame in VectorAssembler format containing two columns: target and features
# DataFrame we want to evaluate
df
# Fitted pyspark.ml.tuning.TrainValidationSplitModel (any arbitrary ml algorithm)
model
1. Option
Neither BinaryClassificationEvaluator nor MulticlassClassificationEvaluator can calculate all metrics mentioned above on their own. Thus, we use both evaluators.
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
# Create both evaluators
evaluatorMulti = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction")
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')
# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")
# Get metrics
acc = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "accuracy"})
f1 = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "f1"})
weightedPrecision = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedPrecision"})
weightedRecall = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedRecall"})
auc = evaluator.evaluate(predictionAndTarget)
Downside
weightedPrecision
and weightedRecall
(which is ok for a multi class classification). However, are these two metrics equal to precision
and recall
in a binary case ?2. Option
Using RDD based API with BinaryClassificatinMetrics and MulticlassMetrics. Again, both metrics can't calculate all metrics mentioned above on their own (at least not in python ..). Thus, we use both.
from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics
# Make prediction
predictionAndTarget = model.transform(df).select("target", "prediction")
# Create both evaluators
metrics_binary = BinaryClassificationMetrics(predictionAndTarget.rdd.map(tuple))
metrics_multi = MulticlassMetrics(predictionAndTarget.rdd.map(tuple))
acc = metrics_multi.accuracy
f1 = metrics_multi.fMeasure(1.0)
precision = metrics_multi.precision(1.0)
recall = metrics_multi.recall(1.0)
auc = metrics_binary.areaUnderROC
Downsides
Upside
Surprise
f1
and areaUnderRoc
values when using Option 1 vs when using Option 2.Option 3
Use numpy and sklearn
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, f1_score
# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")
predictionAndTargetNumpy = np.array((predictionAndTarget.collect()))
acc = accuracy_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
f1 = f1_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
precision = precision_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
recall = recall_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
auc = roc_auc_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
Downside
To summarize my questions:
Not sure if it is relevant now , but can answer your question 3 and thus may be question 1 inturn-
Spark ML provides Weighted Precision & Weighted Recall metrics only as part of MulticlassClassificationEvaluator module. If you're looking to have equivalent interpretation of Overall Precision metric, especially incase of Binary Classification equivalent to Scikit world , then better to compute Confusion Matrix and evaluate using the formula of Precision & Recall
Weighted precision ,used by Spark ML ,is computed using precision of both the classes and then adding using weight of each class label in test set i.e.
Prec (Label 1) = TP/(TP+FP)
Prec (Label 0) = TN/(TN+FN)
Weight of Label 1 in test set WL1 = L1/(L1+L2)
Weight of Label 0 in test set WL2 = L2/(L1+L2)
Weighted precision = (PrecL1 * WL1) + (PrecL0 * WL2)
Weighted Precision &Recall will be more than Overall Precision & Recall in case of even slight class imbalance in the dataset and thus metrics between Sklearn based & Spark ML based will differ.
As an illustration , a Confusion Matrix of class imbalance dataset as below :
array([[3969025, 445123],
[ 284283, 1663913]])
Total 1 Class labels 1948196
Total 0 Class labels 4414148
Proportion Label 1 :0.306207272
Proportion Label 0 :0.693792728
Spark ML will give metrics :
Accuracy : 0.8853557745384405
Weighted Precision : 0.8890015815237463
WeightedRecall : 0.8853557745384406
F-1 Score : 0.8865644697253956
whereas Actual Overall metrics computation gives (Scikit Equivalent):
Accuracy: 0.8853557745384405
Precision: 0.7889448070113549
Recall: 0.8540788503826103
AUC: 0.8540788503826103
f1: 0.8540788503826103
Thus Spark ML weighted version inflates the otherwise Overall metric computation that we observe especially for Binary Classification
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With