SPARK, ML, Tuning, CrossValidator: access the metrics

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline:

val cv = new CrossValidator()
        .setEstimator(pipeline)
        .setEstimatorParamMaps(paramGrid)
        .setEvaluator(new MulticlassClassificationEvaluator)
        .setNumFolds(10)

val cvModel = cv.fit(trainingSet)

The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes.

Is it possible to access the metrics calculated for best model?

Ideally, I would like to access the metrics of all models to see how changing the parameters is changing the quality of the classification. But for the moment, the best model is good enough.

FYI, I am using Spark 1.6.0

What is CrossValidator in Pyspark?

Cross-Validation CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

What is pipeline Pyspark?

A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer .

What is spark ML?

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

Here's how I do it:

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer))

...

val paramGrid = new ParamGridBuilder()
  .addGrid(tf.numFeatures, Array(10, 100))
  .addGrid(idf.minDocFreq, Array(1, 10))
  .addGrid(word2Vec.vectorSize, Array(200, 300))
  .addGrid(classifier.maxDepth, Array(3, 5))
  .build()

paramGrid.size // 16 entries

...

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)

...

val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel]

// Explain params for each stage
val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams
val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams
val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams
val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams

SPARK, ML, Tuning, CrossValidator: access the metrics

Tags:

apache-spark

apache-spark-ml

apache-spark-mllib

Rami

People also ask

1 Answers

Chris Fregly

Recent Activity

Donate For Us

SPARK, ML, Tuning, CrossValidator: access the metrics

Tags:

apache-spark

apache-spark-ml

apache-spark-mllib

Rami

People also ask

1 Answers

Chris Fregly

Related questions

Recent Activity

Donate For Us