Is Kryo serialization still required when working with the Dataset API? Because Datasets use Encoders for or serialization and deserialization: <ol> <li>Does Kyro serialization even work for Datasets? (Provided the right config is passed to Spark, and classes are properly registered)</li> <li>If it works, how much performance improvement would it provide? Thanks.</li> </ol>

You don't need to use Kryo for a dataset if you have an Encoder in scope that can serialize the dataset's type (like an ExpressionEncoder or RowEncoder). Those can do field-level serialization so you can do things like filter on a column within the dataset without unpacking the whole object. Encoders have other optimizations like "runtime code generation to build custom bytecode for serialization and deserialization," and can be many times faster than Kryo. However if you try to put a type in a Dataset and Spark can't find an Encoder for it, you'll get an error either at compile time or at runtime (if an unserializable type is nested inside a case class or something). For example, let's say that you wanted to use the DoubleRBTreeSet from the fastutil library. In that situation you'd need to define an Encoder for it, and a quick fix is often to use Kryo: <pre class="prettyprint"><code>implicit val rbTreeEncoder = Encoders.kryo[DoubleRBTreeSet] </code></pre>

Kryo Serialization for Spark 2.x Dataset

1 Answers

You don't need to use Kryo for a dataset if you have an Encoder in scope that can serialize the dataset's type (like an ExpressionEncoder or RowEncoder). Those can do field-level serialization so you can do things like filter on a column within the dataset without unpacking the whole object. Encoders have other optimizations like "runtime code generation to build custom bytecode for serialization and deserialization," and can be many times faster than Kryo.

However if you try to put a type in a Dataset and Spark can't find an Encoder for it, you'll get an error either at compile time or at runtime (if an unserializable type is nested inside a case class or something). For example, let's say that you wanted to use the DoubleRBTreeSet from the fastutil library. In that situation you'd need to define an Encoder for it, and a quick fix is often to use Kryo:

implicit val rbTreeEncoder = Encoders.kryo[DoubleRBTreeSet]

answered Sep 20 '22 16:09

Matt

Related questions
                            
                                Hazelcast 3.6.1 "There is no suitable de-serializer for type" exception
                            
                                How to change default Serializer for an Akka application?
                            
                                Understanding Kryo serialization buffer overflow error
                            
                                java.lang.StackOverflowError when using Kryo to serialize objects with references to each other
                            
                                Why does Kafka Direct Stream create a new decoder for every message?
                            
                                Spark Kryo register for array class
                            
                                Kryo serialization dependent on Java version?
                            
                                spark - How to reduce the shuffle size of a JavaPairRDD<Integer, Integer[]>?
                            
                                Flink 1.4 AvroUtils error
                            
                                Java serialization, Kryo and the object graph
                            
                                How can I use Kryo to serialize an object and deserialize it again? [closed]
                            
                                Kryo serializer causing exception on underlying Scala class WrappedArray
                            
                                Does Kryo help in SparkSQL?
                            
                                Generic Java serialization/deserialization using Kryo
                            
                                How to let Spark serialize an object using Kryo?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kryo Serialization for Spark 2.x Dataset

Tags:

apache-spark-dataset

apache-spark-2.0

kryo

Yasin

People also ask

1 Answers

Matt

Recent Activity

Donate For Us