Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cannot load pipeline model from pyspark

Hello I try to load saved pipeline with Pipeline Model in pyspark.

    selectedDf = reviews\
        .select("reviewerID", "asin", "overall")

    # Make pipeline to build recommendation
    reviewerIndexer = StringIndexer(
        inputCol="reviewerID",
        outputCol="intReviewer"
        )
    productIndexer = StringIndexer(
        inputCol="asin",
        outputCol="intProduct"
        )
    pipeline = Pipeline(stages=[reviewerIndexer, productIndexer])
    pipelineModel = pipeline.fit(selectedDf)
    transformedFeatures = pipelineModel.transform(selectedDf)
    pipeline_model_name = './' + model_name + 'pipeline'
    pipelineModel.save(pipeline_model_name)

This code successfully save model in filesystem but the problem is that I can't load this pipeline to utilize it on other data. When I try to load model with following code I have this kind of error.

        pipelineModel = PipelineModel.load(pipeline_model_name)

Traceback (most recent call last):
  File "/app/spark/load_recommendation_model.py", line 12, in <module>
    sa.load_model(pipeline_model_name, recommendation_model_name, user_id)
  File "/app/spark/sparkapp.py", line 142, in load_model
    pipelineModel = PipelineModel.load(pipeline_model_name)
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 311, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 240, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 497, in loadMetadata
  File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1379, in first
ValueError: RDD is empty

What is the problem? How can I solve this?

like image 503
강희명 Avatar asked Jul 10 '18 05:07

강희명


People also ask

What is PipelineModel?

A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a PipelineModel to an input dataset. pipeline: DataFrame =[fit]=> DataFrame (using transformers and estimators)

How does Pipeline work in Apache spark?

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator . These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame .


1 Answers

I had the same issue. The problem was that I was running Spark on a cluster of nodes, but I wasn't using a shared file system to save my models. Thus, saving the trained model leaded to saving the model's data on the Spark workers which had the data in their memory. When I wanted to load the data, I used the same path which I used in the saving process. In this situation, Spark master goes and looks for the model in the specified path in ITS LOCAL, but the data is not complete there. Therefore, it asserts that the RDD (the data) is empty (if you take a look at the directory of the saved model you will see that there are only SUCCESS files, but for loading models, two other part-0000 files are necessary).

Using shared file systems like HDFS will fix the problem.

like image 165
Hossein Keshavarz Avatar answered Oct 23 '22 05:10

Hossein Keshavarz