I've estimated a logistic regression using pipelines.
My last few lines before fitting the logistic regression:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="lr_features", labelCol = "targetvar")
# create assember to include encoded features
lr_assembler = VectorAssembler(inputCols= numericColumns +
[categoricalCol + "ClassVec" for categoricalCol in categoricalColumns],
outputCol = "lr_features")
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
# Model definition:
lr = LogisticRegression(featuresCol = "lr_features", labelCol = "targetvar")
# Pipeline definition:
lr_pipeline = Pipeline(stages = indexStages + encodeStages +[lr_assembler, lr])
# Fit the logistic regression model:
lrModel = lr_pipeline.fit(train_train)
And then I tried to run the summary of the model. However, the code line below:
trainingSummary = lrModel.summary
results in: 'PipelineModel' object has no attribute 'summary'
Any advice on how one could extract the summary information that is usually contained in regression's model from a pipeline model?
Thanks a lot!
Build Logistic Regression model In order to train and test the model the data set need to be split into a training data set and a test data set. 70% of the data is used to train the model, and 30% will be used for testing. The same model can use built with spark Pipeline .
A Pipeline is an Estimator . Thus, after a Pipeline 's fit() method runs, it produces a PipelineModel , which is a Transformer . This PipelineModel is used at test time; the figure below illustrates this usage.
Just get the model from stages:
lrModel.stages[-1].summary
If model is earlier in the Pipeline replace -1 with its index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With