Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save Apache Spark mllib model in python [duplicate]

I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java serialization. I am doing the equivalent in python but it does not work. What is the trick?

model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=nb_tree,featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=depth)
output = open('model.ml', 'wb')
pickle.dump(model,output)

I am getting this error:

TypeError: can't pickle lock objects

I am using Apache Spark 1.2.0.

like image 314
poiuytrez Avatar asked Feb 10 '15 09:02

poiuytrez


People also ask

How do you save a model on MLlib?

You can save your model by using the save method of mllib models. After storing it you can load it in another application. As @zero323 stated before, there is another way to achieve this, and is by using the Predictive Model Markup Language (PMML).

How can we save the PySpark pipeline model?

You can now save your pipeline: >>> model. save("/tmp/rf") SLF4J: Failed to load class "org.

What is MLlib in Apache Spark?

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

Is MLlib part of spark?

Community. MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.


1 Answers

If you look at the source code, you'll see that the RandomForestModel inherits from the TreeEnsembleModel which in turn inherits from JavaSaveable class that implements the save() method, so you can save your model like in the example below:

model.save([spark_context], [file_path])

So it will save the model into the file_path using the spark_context. You cannot use (at least until now) the Python nativle pickle to do that. If you really want to do that, you'll need to implement the methods __getstate__ or __setstate__ manually. See this pickle documentation for more information.

like image 199
Tarantula Avatar answered Oct 20 '22 13:10

Tarantula