Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to obtain the trained best model from a crossvalidator

I built a pipeline including a DecisionTreeClassifier(dt) like this

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))

Then I used this pipeline as the estimator in a CrossValidator in order to get a model with the best set of hyperparameters like this

val c_v = new CrossValidator().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(5)

Finally, I could train a model on a training test with this crossvalidator

val model = c_v.fit(train)

But the question is, I want to view the best trained decision tree model with the parameter .toDebugTree of DecisionTreeClassificationModel. But model is a CrossValidatorModel. Yes, you can use model.bestModel, but it is still of type Model, you cannot apply .toDebugTree to it. And also I assume the bestModel is still a pipline including labelIndexer, featureIndexer, dt, labelConverter.

So does anyone know how I can obtain the decisionTree model from the model fitted by the crossvalidator, which I could view the actual model by toDebugString? Or is there any workaround that I can view the decisionTree model?

like image 906
Fan L. Avatar asked Apr 01 '16 04:04

Fan L.


People also ask

What is CrossValidator in Pyspark?

Cross-Validation CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

How does spark ml work?

ML Dataset: Spark ML uses the SchemaRDD from Spark SQL as a dataset which can hold a variety of data types. E.g., a dataset could have different columns storing text, feature vectors, true labels, and predictions. Transformer : A Transformer is an algorithm which can transform one SchemaRDD into another SchemaRDD .

What is pipeline Pyspark?

A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer .


2 Answers

Well, in cases like this one the answer is always the same - be specific about the types.

First extract the pipeline model, since what you are trying to train is a Pipeline :

import org.apache.spark.ml.PipelineModel

val bestModel: Option[PipelineModel] = model.bestModel match {
  case p: PipelineModel => Some(p)
  case _ => None
}

Then you'll need to extract the model from the underlying stage. In your case it's a decision tree classification model :

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

val treeModel: Option[DecisionTreeClassificationModel] = bestModel
  flatMap {
    _.stages.collect {
      case t: DecisionTreeClassificationModel => t
    }.headOption
  }

To print the tree, for example :

treeModel.foreach(_.toDebugString)
like image 137
zero323 Avatar answered Oct 24 '22 04:10

zero323


(DISCLAIMER: There is another aspect, which imho deserves its own answer. I know it is a little OT given the question, however, it questions the question. If somebody down votes because he disagrees with the content please also leave a comment)

Should you extract the "best" tree and the answer is typically no.

Why are we doing CV? We are trying to evaluate our choices, to get. The choices are the classifiers used, hyper parameter used, preprocessing like feature selection. For the last one it is important that this happens on the training data. E.g., do not normalise the features on all data. So the output of CV is the pipeline generated. On a side note: the feature selection should evaluated on a "internal cv"

What we are not doing, we are not generating a "pool of classifiers" where we choose the best classifier. However, i've seen this surprisingly often. The problem is that you have an extremely high chance of a twining-effect. Even in a perfectly Iid dataset there are likely (near)duplicated training examples. There is a pretty good chance that the "best" CV classifier is just an indication in which fold you have the best twining.

Hence, what should you do? Once, you have fixed your parameters you should use the entire training data to build the final model. Hopefully, but nobody does this, you have set aside an additional evaluation set, which you have never touched in the process to get an evaluation of your final model.

like image 3
CAFEBABE Avatar answered Oct 24 '22 04:10

CAFEBABE