I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java serialization. I am doing the equivalent in python but it does not work. What is the trick? <pre class="prettyprint"><code>model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={}, numTrees=nb_tree,featureSubsetStrategy="auto", impurity='variance', maxDepth=depth) output = open('model.ml', 'wb') pickle.dump(model,output) </code></pre> I am getting this error: <pre class="prettyprint"><code>TypeError: can't pickle lock objects </code></pre> I am using Apache Spark 1.2.0.

If you look at the source code, you'll see that the <code>RandomForestModel</code> inherits from the <code>TreeEnsembleModel</code> which in turn inherits from <code>JavaSaveable</code> class that implements the <code>save()</code> method, so you can save your model like in the example below: <pre class="prettyprint"><code>model.save([spark_context], [file_path]) </code></pre> So it will save the <code>model</code> into the <code>file_path</code> using the <code>spark_context</code>. You cannot use (at least until now) the Python nativle pickle to do that. If you really want to do that, you'll need to implement the methods <code>__getstate__</code> or <code>__setstate__</code> manually. See this pickle documentation for more information.

Save Apache Spark mllib model in python [duplicate]

Tags:

python

pyspark

apache-spark-mllib

I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java serialization. I am doing the equivalent in python but it does not work. What is the trick?

model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=nb_tree,featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=depth)
output = open('model.ml', 'wb')
pickle.dump(model,output)

I am getting this error:

TypeError: can't pickle lock objects

I am using Apache Spark 1.2.0.

314

asked Feb 10 '15 09:02

poiuytrez

1 Answers

If you look at the source code, you'll see that the RandomForestModel inherits from the TreeEnsembleModel which in turn inherits from JavaSaveable class that implements the save() method, so you can save your model like in the example below:

model.save([spark_context], [file_path])

So it will save the model into the file_path using the spark_context. You cannot use (at least until now) the Python nativle pickle to do that. If you really want to do that, you'll need to implement the methods __getstate__ or __setstate__ manually. See this pickle documentation for more information.

199

answered Oct 20 '22 13:10

Tarantula

Related questions
                            
                                Pip install ignores files in MANIFEST.in - how to structure the project correctly?
                            
                                Python: Stop thread that is waiting for user input
                            
                                python-pyramid app memory is not releasing at all
                            
                                SocketIO emit from Asynchronous Celery worker is not working
                            
                                Correct way to do operations on Memmapped arrays
                            
                                Pandas plot with errorbar: style does not apply
                            
                                Python: Constant Class
                            
                                What's the meaning of __PYVENV_LAUNCHER__ environment variable?
                            
                                How to organize GAE Modules app structure and code?
                            
                                How to enable logging of django rest api CRUD operations in django_admin_log?
                            
                                How to get hold of the object missing an attribute
                            
                                Celery + RabbitMQ + "A socket error ocurred"
                            
                                TypeError: object() takes no parameters - but only in Python 3
                            
                                Splinter or Selenium: Can we get current html page after clicking a button?
                            
                                Python app configuration best practices
                            
                                Is it a bug of design of OpenCV's function "pyrDown"
                            
                                Sublime Text remove python new property autocomplete
                            
                                matplotlib prune tick labels
                            
                                Adding external libraries in PyCharm Professional 4
                            
                                Understanding gc.get_referrers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With