Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kryo Serialization for Spark 2.x Dataset

Is Kryo serialization still required when working with the Dataset API?

Because Datasets use Encoders for or serialization and deserialization:

  1. Does Kyro serialization even work for Datasets? (Provided the right config is passed to Spark, and classes are properly registered)
  2. If it works, how much performance improvement would it provide? Thanks.
like image 756
Yasin Avatar asked Jun 24 '17 08:06

Yasin


People also ask

What is KRYO serialization in spark?

KryoSerializer") . This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application.

Why is KRYO serialization faster in spark?

If you need a performance boost and also need to reduce memory usage, Kryo is definitely for you. The join operations and the grouping operations are where serialization has an impact on and they usually have data shuffling. Now lesser the amount of data to be shuffled, the faster will be the operation.

Can we use KRYO serializer in Pyspark?

Kryo serialization: Spark can also use the Kryo library (version 4) to serialize objects more quickly.

Why is KRYO serialized?

Kryo is a fast and efficient binary object graph serialization framework for Java. The goals of the project are high speed, low size, and an easy to use API. The project is useful any time objects need to be persisted, whether to a file, database, or over the network.


1 Answers

You don't need to use Kryo for a dataset if you have an Encoder in scope that can serialize the dataset's type (like an ExpressionEncoder or RowEncoder). Those can do field-level serialization so you can do things like filter on a column within the dataset without unpacking the whole object. Encoders have other optimizations like "runtime code generation to build custom bytecode for serialization and deserialization," and can be many times faster than Kryo.

However if you try to put a type in a Dataset and Spark can't find an Encoder for it, you'll get an error either at compile time or at runtime (if an unserializable type is nested inside a case class or something). For example, let's say that you wanted to use the DoubleRBTreeSet from the fastutil library. In that situation you'd need to define an Encoder for it, and a quick fix is often to use Kryo:

implicit val rbTreeEncoder = Encoders.kryo[DoubleRBTreeSet]
like image 73
Matt Avatar answered Sep 20 '22 16:09

Matt