Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to serialize a pyspark Pipeline object?

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing.

Providing the code snippet while using native pickle library.

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
with open ('myfile', 'wb') as f:
   pickle.dump(pipeline,f,2)
with open ('myfile', 'rb') as f:
   pipeline1 = pickle.load(f)

Getting the below error while running:

py4j.protocol.Py4JError: An error occurred while calling o32.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:785)

Is it possible to serialize PySpark Pipeline objects ?

like image 877
Dinoop Thomas Avatar asked Apr 15 '16 14:04

Dinoop Thomas


1 Answers

Technically speaking you can easily pickle Pipeline object:

from pyspark.ml.pipeline import Pipeline
import pickle

pickle.dumps(Pipeline(stages=[]))
## b'\x80\x03cpyspark.ml.pipeline\nPipeline\nq ...

What you cannot pickle is Spark Transformers and Estimators which are only thin wrappers around JVM objects. If you really need this you can wrap this in a function for example:

def make_pipeline():
    return Pipeline(stages=[Tokenizer(inputCol="text", outputCol="words")])

pickle.dumps(make_pipeline)
## b'\x80\x03c__ ...

but since it is just a piece of code and doesn't store any persistent data it doesn't look particularly useful.

like image 169
zero323 Avatar answered Oct 02 '22 09:10

zero323