I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
Running this in PySpark shell, I can get the linear regression model's coefficients, but I can't seem to find the value of lr.regParam selected by the cross validation procedure. Any ideas?
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])
In [4]: cvModel.bestModel.explainParams()
Out[4]: ''
In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}
In [15]: cvModel.params
Out[15]: []
In [36]: cvModel.bestModel.params
Out[36]: []
Ran into this problem as well. I found out you need to call the java property for some reason I don't know why. So just do this:
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                .addGrid(lr.regParam, [0]) \
                                .addGrid(lr.elasticNetParam, [1]) \
                                .build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                        evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel
Printing out the parameters you want:
>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1
This applies to other methods like extractParamMap() as well. They should fix this soon.
This might not be as good as wernerchao answer (because it's not convenient to store hyperparameters in variables), but you can quickly look at the best hyper-parameters of a cross validation model this way :
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]
Assuming cvModel3Day is your model names, params can be extracted as shown below in Spark Scala
val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()
val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth
val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter
val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins
val features  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol
val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize
val samplingRate  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With