Cannot load pipeline model from pyspark

Tags:

Hello I try to load saved pipeline with Pipeline Model in pyspark.

    selectedDf = reviews\
        .select("reviewerID", "asin", "overall")

    # Make pipeline to build recommendation
    reviewerIndexer = StringIndexer(
        inputCol="reviewerID",
        outputCol="intReviewer"
        )
    productIndexer = StringIndexer(
        inputCol="asin",
        outputCol="intProduct"
        )
    pipeline = Pipeline(stages=[reviewerIndexer, productIndexer])
    pipelineModel = pipeline.fit(selectedDf)
    transformedFeatures = pipelineModel.transform(selectedDf)
    pipeline_model_name = './' + model_name + 'pipeline'
    pipelineModel.save(pipeline_model_name)

This code successfully save model in filesystem but the problem is that I can't load this pipeline to utilize it on other data. When I try to load model with following code I have this kind of error.

        pipelineModel = PipelineModel.load(pipeline_model_name)

Traceback (most recent call last):
  File "/app/spark/load_recommendation_model.py", line 12, in <module>
    sa.load_model(pipeline_model_name, recommendation_model_name, user_id)
  File "/app/spark/sparkapp.py", line 142, in load_model
    pipelineModel = PipelineModel.load(pipeline_model_name)
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 311, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 240, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 497, in loadMetadata
  File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1379, in first
ValueError: RDD is empty

What is the problem? How can I solve this?

503

asked Jul 10 '18 05:07

강희명

1 Answers

I had the same issue. The problem was that I was running Spark on a cluster of nodes, but I wasn't using a shared file system to save my models. Thus, saving the trained model leaded to saving the model's data on the Spark workers which had the data in their memory. When I wanted to load the data, I used the same path which I used in the saving process. In this situation, Spark master goes and looks for the model in the specified path in ITS LOCAL, but the data is not complete there. Therefore, it asserts that the RDD (the data) is empty (if you take a look at the directory of the saved model you will see that there are only SUCCESS files, but for loading models, two other part-0000 files are necessary).

Using shared file systems like HDFS will fix the problem.

165

answered Oct 23 '22 05:10

Hossein Keshavarz

Related questions
                            
                                How to use ReduceByKey on multiple key in a Scala Spark Job
                            
                                Is there any means to serialize custom Transformer in Spark ML Pipeline
                            
                                Is it possible to set global variables in a Zeppelin Notebook?
                            
                                Does Spark write intermediate shuffle outputs to disk
                            
                                spark - How to reduce the shuffle size of a JavaPairRDD<Integer, Integer[]>?
                            
                                Spark: How to delete a specific variable from spark-shell memory namespace?
                            
                                what is raw prediction in Logistic Regression in spark mllib?
                            
                                Setup and configuration of JanusGraph for a Spark cluster and Cassandra
                            
                                How to start Spark Thrift Server on Datastax Enterprise (fails with java.lang.NoSuchMethodError: ...LogDivertAppender.setWriter)?
                            
                                How to set Kafka parameters from a properties file?
                            
                                How to map rows to protobuf-generated class?
                            
                                Submit a Spark job from C# and get results
                            
                                write a spark Dataset to json with all keys in the schema, including null columns
                            
                                Remove special character from a column in dataframe
                            
                                Spark Dataframe hanging on save
                            
                                SparkR DataFrame partitioning issue
                            
                                spark-shell: strange behavior with import
                            
                                ERROR WHILE RUNNING collect() in PYSPARK
                            
                                Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?
                            
                                Continuous trigger not found in Structured Streaming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cannot load pipeline model from pyspark

Tags:

apache-spark

pyspark

apache-spark-mllib

강희명

People also ask

1 Answers

Hossein Keshavarz

Recent Activity

Donate For Us