I am running a logistic regression in PySpark using spark version: 2.1.2 I know it is possible to save a regression model as follows: <pre class="prettyprint"><code>model = LogisticRegression(featuresCol='features', labelCol='is_clickout', regParam=0, fitIntercept=False, family="binomial") model = pipeline.fit(data) # save model for future use save_path = "model_0" model.save(save_path) </code></pre> The problem is that the saved model does not save the summary: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.ml.classification import LogisticRegressionModel model2 = LogisticRegressionModel.load(save_path) model2.hasSummary ##### Returns FALSE </code></pre> I can extract the summary as follows, but it has no save method attached to it: <pre class="prettyprint lang-py prettyprint-override"><code># Get the model summary summary = model.stages[-1].summary </code></pre> Is there a quick way to save the summary object? For multiple regressions? Currently, I read all the object attributes and save them as a Pandas dataframe <code>df</code>.

Unfortunately, your observation is correct. I had the same problem with Spark 2.4.3 and I've found this comment confirming the issue: <blockquote> For LinearRegressionModel, this does NOT currently save the training summary. An option to save summary may be added in the future. </blockquote> This same comment is still there for Spark 3.0.0-rc1 (the last available tag in its repository). If we want to persist the summary, we need to serialize it somehow ourselves. I've done this before by extracting the statistics I wanted and saving them in a JSON document just after training my model.

Save spark model summary

Tags:

python

apache-spark

logistic-regression

pyspark

I am running a logistic regression in PySpark using spark version: 2.1.2

I know it is possible to save a regression model as follows:

model = LogisticRegression(featuresCol='features',
                           labelCol='is_clickout',
                           regParam=0,
                           fitIntercept=False,
                           family="binomial")

model = pipeline.fit(data)

# save model for future use
save_path = "model_0"
model.save(save_path)

The problem is that the saved model does not save the summary:

from pyspark.ml.classification import LogisticRegressionModel
model2 = LogisticRegressionModel.load(save_path)
model2.hasSummary ##### Returns FALSE

I can extract the summary as follows, but it has no save method attached to it:

# Get the model summary
summary = model.stages[-1].summary

Is there a quick way to save the summary object? For multiple regressions?

Currently, I read all the object attributes and save them as a Pandas dataframe df.

751

asked Dec 11 '18 10:12

hamiq

Video Answer

1 Answers

Unfortunately, your observation is correct. I had the same problem with Spark 2.4.3 and I've found this comment confirming the issue:

For LinearRegressionModel, this does NOT currently save the training summary. An option to save summary may be added in the future.

This same comment is still there for Spark 3.0.0-rc1 (the last available tag in its repository).

If we want to persist the summary, we need to serialize it somehow ourselves. I've done this before by extracting the statistics I wanted and saving them in a JSON document just after training my model.

135

answered Sep 30 '22 21:09

boechat107

Related questions
                            
                                How to declare multiple variables with type annotation syntax in Python?
                            
                                How to overcome "OperationalError: too many SQL variables"
                            
                                Python Deployment Package with SKLEARN, PANDAS and NUMPY issue?
                            
                                What's the sequence of middleware execution in django when error occurs in process_request?
                            
                                Getting "title already used as a name or title" error while reading SPSS (.sav) file in Python
                            
                                Can't import subprocess python3.6
                            
                                Typing __exit__ in 3.5 fails on runtime, but typechecks
                            
                                How to really create n tasks in a SubDAG based on the result of a previous task
                            
                                TemplateResponseMixin requires either a definition of 'template_name' or an implementation of 'get_template_names()'
                            
                                Pandas pivot_table with pd.grouper and Margins
                            
                                Use TensorFlow python code with android app
                            
                                Sorting a dictionary with multiple sized values
                            
                                PyInstaller - How do you handle environmental variables?
                            
                                Selenium does not work with a chromedriver modified to avoid detection
                            
                                Match entities by fuzzy matching of multiple variables
                            
                                What is Killed:9 and how to fix in macOS Terminal?
                            
                                Rename pandas dataframe columns whose type is RangeIndex [duplicate]
                            
                                Scatter 3D for Large Data-Set in Plotly
                            
                                Unable to load libhdfs when using pyarrow
                            
                                How to preserve milliseconds when converting a date and time string to timestamp using PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With