What is the right way to save\load models in Spark\PySpark

Tags:

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation )

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
predictions.collect() # shows me some predictions
model.save(sc, "model0")

# Trying to load saved model and work with it
model0 = MatrixFactorizationModel.load(sc, "model0")
predictions0 = model0.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

After I try to use model0 I get a long traceback, which ends with this:

Py4JError: An error occurred while calling o70.predict. Trace:
py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

So my question is - am I doing something wrong? As far as I debugged my models are stored (locally and on HDFS) and they contain many files with some data. I have a feeling that models are saved correctly but probably they aren't loaded correctly. I also googled around but found nothing related.

Looks like this save\load feature has been added recently in Spark 1.3.0 and because of this I have another question - what was the recommended way to save\load models before the release 1.3.0? I haven't found any nice ways to do this, at least for Python. I also tried Pickle, but faced with the same issues as described here Save Apache Spark mllib model in python

381

asked Mar 25 '15 12:03

artemdevel

2 Answers

One way to save a model (in Scala; but probably is similar in Python):

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("linReg.model")

Saved model can then be loaded as:

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

Neil

As of this pull request merged on Mar 28, 2015 (a day after your question was last edited) this issue has been resolved.

You just need to clone/fetch the latest version from GitHub (git clone git://github.com/apache/spark.git -b branch-1.3) then build it (following the instructions in spark/README.md) with $ mvn -DskipTests clean package.

Note: I ran into trouble building Spark because Maven was being wonky. I resolved that issue by using $ update-alternatives --config mvn and selecting the 'path' that had Priority: 150, whatever that means. Explanation here.

answered Oct 28 '22 15:10

emmagras

Related questions
                            
                                Using MongoDB as our master database, should I use a separate graph database to implement relationships between entities?
                            
                                Eggs in path before PYTHONPATH environment variable
                            
                                Can the pudb debugger be used on windows?
                            
                                Passing variables between Python and Javascript
                            
                                Get a list of python packages used by a Django Project
                            
                                Writing functions that accept both 1-D and 2-D numpy arrays?
                            
                                Catching exceptions in django templates
                            
                                Stackless in PyPy and PyPy + greenlet - differences
                            
                                git cannot execute python-script as hook
                            
                                How to stop a python socket.accept() call?
                            
                                Conditional shebang line for different versions of Python
                            
                                OpenCV - imread(), imwrite() increases the size of png?
                            
                                Using methods defined in __init__.py within the module
                            
                                Combining websockets and WSGI in a python app
                            
                                Load Excel file into numpy 2D array
                            
                                How to transmit Android real-time sensor data to computer?
                            
                                python imaging library: Can I simply fill my image with one color?
                            
                                Why Python need rich comparison?
                            
                                sample weights in scikit-learn broken in cross validation
                            
                                sklearn.ensemble.AdaBoostClassifier cannot accecpt SVM as base_estimator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the right way to save\load models in Spark\PySpark

Tags:

python

apache-spark

pyspark

apache-spark-mllib

artemdevel

People also ask

2 Answers

Neil

emmagras

Recent Activity

Donate For Us