I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing.
Providing the code snippet while using native pickle library.
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
with open ('myfile', 'wb') as f:
   pickle.dump(pipeline,f,2)
with open ('myfile', 'rb') as f:
   pipeline1 = pickle.load(f)
Getting the below error while running:
py4j.protocol.Py4JError: An error occurred while calling o32.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:785)
Is it possible to serialize PySpark Pipeline objects ?
Technically speaking you can easily pickle Pipeline object:
from pyspark.ml.pipeline import Pipeline
import pickle
pickle.dumps(Pipeline(stages=[]))
## b'\x80\x03cpyspark.ml.pipeline\nPipeline\nq ...
What you cannot pickle is Spark Transformers and Estimators which are only thin wrappers around JVM objects. If you really need this you can wrap this in a function for example:
def make_pipeline():
    return Pipeline(stages=[Tokenizer(inputCol="text", outputCol="words")])
pickle.dumps(make_pipeline)
## b'\x80\x03c__ ...
but since it is just a piece of code and doesn't store any persistent data it doesn't look particularly useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With