Hello I try to load saved pipeline with Pipeline Model in pyspark.
selectedDf = reviews\
.select("reviewerID", "asin", "overall")
# Make pipeline to build recommendation
reviewerIndexer = StringIndexer(
inputCol="reviewerID",
outputCol="intReviewer"
)
productIndexer = StringIndexer(
inputCol="asin",
outputCol="intProduct"
)
pipeline = Pipeline(stages=[reviewerIndexer, productIndexer])
pipelineModel = pipeline.fit(selectedDf)
transformedFeatures = pipelineModel.transform(selectedDf)
pipeline_model_name = './' + model_name + 'pipeline'
pipelineModel.save(pipeline_model_name)
This code successfully save model in filesystem but the problem is that I can't load this pipeline to utilize it on other data. When I try to load model with following code I have this kind of error.
pipelineModel = PipelineModel.load(pipeline_model_name)
Traceback (most recent call last):
File "/app/spark/load_recommendation_model.py", line 12, in <module>
sa.load_model(pipeline_model_name, recommendation_model_name, user_id)
File "/app/spark/sparkapp.py", line 142, in load_model
pipelineModel = PipelineModel.load(pipeline_model_name)
File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 311, in load
File "/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 240, in load
File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 497, in loadMetadata
File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1379, in first
ValueError: RDD is empty
What is the problem? How can I solve this?
A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a PipelineModel to an input dataset. pipeline: DataFrame =[fit]=> DataFrame (using transformers and estimators)
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator . These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame .
I had the same issue. The problem was that I was running Spark on a cluster of nodes, but I wasn't using a shared file system to save my models. Thus, saving the trained model leaded to saving the model's data on the Spark workers which had the data in their memory. When I wanted to load the data, I used the same path which I used in the saving process. In this situation, Spark master goes and looks for the model in the specified path in ITS LOCAL, but the data is not complete there. Therefore, it asserts that the RDD (the data) is empty (if you take a look at the directory of the saved model you will see that there are only SUCCESS
files, but for loading models, two other part-0000
files are necessary).
Using shared file systems like HDFS will fix the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With