I read that the Kryo serializer can provide faster serialization when used in Apache Spark. However, I'm using Spark through Python.
Do I still get notable benefits from switching to the Kryo serializer?
Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. 
But it may be worth a try — you would just set the spark.serializer configuration and trying not to register any classe. 
What might make more impact is storing your data as MEMORY_ONLY_SER and enabling spark.rdd.compress, which will compress them your data. 
In Java this can add some CPU overhead, but Python runs quite a bit slower, so it might not matter. It might also speed up computation by reducing GC or letting you cache more data.
Reference : Matei Zaharia's answer in the mailing list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With