I am already compressing RDDs using conf.set("spark.rdd.compress","true")
and persist(MEMORY_AND_DISK_SER)
. Will using Kryo serialization make the program even more efficient, or is it not useful in this case? I know that Kryo is for sending the data between the nodes in a more efficient way. But if the communicated data is already compressed, is it even needed?
Kryo is a fast and efficient binary object graph serialization framework for Java. The goals of the project are high speed, low size, and an easy to use API. The project is useful any time objects need to be persisted, whether to a file, database, or over the network.
Serialization is used for performance tuning on Apache Spark. All data that is sent over the network or written to the disk or persisted in the memory should be serialized. Serialization plays an important role in costly operations.
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'll use in the program in advance for best performance. So it is not used by default because: Not every java. io.
Kryo won't make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set the spark. serializer configuration and trying not to register any classe.
Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that serialization is also used for shuffles (sending data between nodes): any time data needs to leave a JVM, whether it's going to local disk or through the network, it needs to be serialized.
Kryo is a significantly optimized serializer, and performs better than the standard java serializer for just about everything. In your case, you may actually be using Kryo already. You can check your spark configuration parameter:
"spark.serializer" should be "org.apache.spark.serializer.KryoSerializer".
If it's not, then you can set this internally with:
conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )
Regarding your last question ("is it even needed?"), it's hard to make a general claim about that. Kryo optimizes one of the slow steps in communicating data, but it's entirely possible that in your use case, others are holding you back. But there's no downside to trying Kryo and benchmarking the difference!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With