from pyspark.ml.regression import RandomForestRegressionModel
rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42)
rf_model = rf.fit(train_df)
rf_model_path = "./hdfsData/" + "rfr_model"
rf_model.save(rf_model_path)
When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error:
Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData/rfr_model already exists. Please use write.overwrite().save(path) to overwrite it.
Then I tried:
rf_model.write.overwrite().save(rf_model_path)
It gave:
AttributeError: 'function' object has no attribute 'overwrite'
It seems the pyspark.mllib
module gives the overwrite function but not pyspark.ml
module. Anyone knows how to resolve this if I want to overwrite the old model with the new model? Thanks.
Create a ML model & pickle it and store pickle file in HDFS. 2. Write a spark job and unpickle the python object. 3. Broadcast this python object over all Spark nodes. 4. Create a pyspark UDF and call predict method on broadcasted model object. 5. Create a feature column list on which ML model was trained on. 6.
Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas. In this blog, I will cover the steps of building a Machine Learning model using PySpark. For this project, we are using events data of a music streaming company named Sparkify provided by Udacity.
One thing that comes to mind is Spark. Most of the data scientists are familiar with Python but Spark is in scala so are we going to learn a new language or is there something we can do in python only. Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas.
You need to use this Overwrite as an argument to mode () function of the DataFrameWrite class, for example. Note that this is not supported in PySpark. df. write. mode ( SaveMode. Overwrite). csv ("/tmp/out/foldername") For PySpark use overwrite string. This option can also be used with Scala.
The message you see is a Java error message, not a Python one. You should call the write
method first:
rf_model.write().overwrite().save(rf_model_path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With