How to serialize a pyspark Pipeline object?

Question

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing.

Providing the code snippet while using native pickle library.

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
with open ('myfile', 'wb') as f:
   pickle.dump(pipeline,f,2)
with open ('myfile', 'rb') as f:
   pipeline1 = pickle.load(f)

Getting the below error while running:

py4j.protocol.Py4JError: An error occurred while calling o32.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:785)

Is it possible to serialize PySpark Pipeline objects ?

zero323 · Accepted Answer

Technically speaking you can easily pickle Pipeline object:

from pyspark.ml.pipeline import Pipeline
import pickle

pickle.dumps(Pipeline(stages=[]))
## b'\x80\x03cpyspark.ml.pipeline
Pipeline
q ...

What you cannot pickle is Spark Transformers and Estimators which are only thin wrappers around JVM objects. If you really need this you can wrap this in a function for example:

def make_pipeline():
    return Pipeline(stages=[Tokenizer(inputCol="text", outputCol="words")])

pickle.dumps(make_pipeline)
## b'\x80\x03c__ ...

but since it is just a piece of code and doesn't store any persistent data it doesn't look particularly useful.

How to serialize a pyspark Pipeline object?

Tags:

python

serialization

apache-spark

pyspark

apache-spark-ml

Dinoop Thomas

1 Answers

zero323

Recent Activity

Donate For Us

How to serialize a pyspark Pipeline object?

Tags:

python

serialization

apache-spark

pyspark

apache-spark-ml

Dinoop Thomas

1 Answers

zero323

Related questions

Recent Activity

Donate For Us