Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use Kryo serialization in Spark?

I am already compressing RDDs using conf.set("spark.rdd.compress","true") and persist(MEMORY_AND_DISK_SER). Will using Kryo serialization make the program even more efficient, or is it not useful in this case? I know that Kryo is for sending the data between the nodes in a more efficient way. But if the communicated data is already compressed, is it even needed?

like image 785
pythonic Avatar asked Oct 26 '16 12:10

pythonic


People also ask

Why is KRYO serialized?

Kryo is a fast and efficient binary object graph serialization framework for Java. The goals of the project are high speed, low size, and an easy to use API. The project is useful any time objects need to be persisted, whether to a file, database, or over the network.

Why do we need serialization in Spark?

Serialization is used for performance tuning on Apache Spark. All data that is sent over the network or written to the disk or persisted in the memory should be serialized. Serialization plays an important role in costly operations.

Why is serialization faster than KRYO serialization?

Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'll use in the program in advance for best performance. So it is not used by default because: Not every java. io.

Can we use KRYO serializer in PySpark?

Kryo won't make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set the spark. serializer configuration and trying not to register any classe.


1 Answers

Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that serialization is also used for shuffles (sending data between nodes): any time data needs to leave a JVM, whether it's going to local disk or through the network, it needs to be serialized.

Kryo is a significantly optimized serializer, and performs better than the standard java serializer for just about everything. In your case, you may actually be using Kryo already. You can check your spark configuration parameter:

"spark.serializer" should be "org.apache.spark.serializer.KryoSerializer".

If it's not, then you can set this internally with:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

Regarding your last question ("is it even needed?"), it's hard to make a general claim about that. Kryo optimizes one of the slow steps in communicating data, but it's entirely possible that in your use case, others are holding you back. But there's no downside to trying Kryo and benchmarking the difference!

like image 104
Tim Avatar answered Oct 11 '22 04:10

Tim