Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overwrite Spark ML model in PySpark?

from pyspark.ml.regression import RandomForestRegressionModel

rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42)
rf_model = rf.fit(train_df)
rf_model_path = "./hdfsData/" + "rfr_model"
rf_model.save(rf_model_path)

When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error:

Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData/rfr_model already exists. Please use write.overwrite().save(path) to overwrite it.

Then I tried:

rf_model.write.overwrite().save(rf_model_path)

It gave:

AttributeError: 'function' object has no attribute 'overwrite'

It seems the pyspark.mllib module gives the overwrite function but not pyspark.ml module. Anyone knows how to resolve this if I want to overwrite the old model with the new model? Thanks.

like image 428
Veronica Wenqian Cheng Avatar asked Feb 17 '17 17:02

Veronica Wenqian Cheng


People also ask

How to train ML model in spark with Python?

Create a ML model & pickle it and store pickle file in HDFS. 2. Write a spark job and unpickle the python object. 3. Broadcast this python object over all Spark nodes. 4. Create a pyspark UDF and call predict method on broadcasted model object. 5. Create a feature column list on which ML model was trained on. 6.

What is pyspark for machine learning?

Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas. In this blog, I will cover the steps of building a Machine Learning model using PySpark. For this project, we are using events data of a music streaming company named Sparkify provided by Udacity.

Is pyspark a Python wrapper of spark?

One thing that comes to mind is Spark. Most of the data scientists are familiar with Python but Spark is in scala so are we going to learn a new language or is there something we can do in python only. Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas.

How to overwrite a Dataframe in pyspark?

You need to use this Overwrite as an argument to mode () function of the DataFrameWrite class, for example. Note that this is not supported in PySpark. df. write. mode ( SaveMode. Overwrite). csv ("/tmp/out/foldername") For PySpark use overwrite string. This option can also be used with Scala.


1 Answers

The message you see is a Java error message, not a Python one. You should call the write method first:

rf_model.write().overwrite().save(rf_model_path)
like image 179
zero323 Avatar answered Nov 10 '22 08:11

zero323