Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract average metrics with Cross-Validation in PySpark

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?

I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.

paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()    

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5)

model = crossval.fit(df)

evaluator.evaluate(model.transform(df))

For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.

like image 931
Ed. Avatar asked Oct 24 '25 07:10

Ed.


1 Answers

In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.

For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.

If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

like image 132
Algorithman Avatar answered Oct 25 '25 23:10

Algorithman