I have kryo serialization turned on with this:
conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )
I want to ensure that a custom class is serialized using kryo when shuffled between nodes. I can register the class with kryo this way:
conf.registerKryoClasses(Array(classOf[Foo]))
As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.
To guarantee that kryo serialization happens, I followed this recommendation from the Spark documentation:
conf.set("spark.kryo.registrationRequired", "true")
But this causes IllegalArugmentException to be thrown ("Class is not registered") for a bunch of different classes which I assume Spark uses internally, for example the following:
org.apache.spark.util.collection.CompactBuffer scala.Tuple3
Surely I do not have to manually register each of these individual classes with kryo? These serializers are all defined in kryo, so is there a way to automatically register all of them?
You can switch to using Kryo by initializing your job with a SparkConf and calling conf. set("spark. serializer", "org. apache.
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'll use in the program in advance for best performance. So it is not used by default because: Not every java. io.
Kryo is a fast and efficient binary object graph serialization framework for Java. The goals of the project are high speed, low size, and an easy to use API. The project is useful any time objects need to be persisted, whether to a file, database, or over the network.
Kryo is a Java serialization framework with a focus on speed, efficiency, and a user-friendly API. In this article, we'll explore the key features of the Kryo framework and implement examples to showcase its capabilities.
As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.
No. If you set spark.serializer
to org.apache.spark.serializer. KryoSerializer
then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback.
So what is this Kryo registration then?
When Kryo serializes an instance of an unregistered class it has to output the fully qualified class name. That's a lot of characters. Instead, if a class has been pre-registered, Kryo can just output a numeric reference to this class, which is just 1-2 bytes.
This is especially crucial when each row of an RDD is serialized with Kryo. You don't want to include the same class name for each of a billion rows. So you pre-register these classes. But it's easy to forget to register a new class and then you're wasting bytes again. The solution is to require every class to be registered:
conf.set("spark.kryo.registrationRequired", "true")
Now Kryo will never output full class names. If it encounters an unregistered class, that's a runtime error.
Unfortunately it's hard to enumerate all the classes that you are going to be serializing in advance. The idea is that Spark registers the Spark-specific classes, and you register everything else. You have an RDD[(X, Y, Z)]
? You have to register classOf[scala.Tuple3[_, _, _]]
.
The list of classes that Spark registers actually includes CompactBuffer
, so if you see an error for that, you're doing something wrong. You are bypassing the Spark registration procedure. You have to use either spark.kryo.classesToRegister
or spark.kryo.registrator
to register your classes. (See the config options. If you use GraphX, your registrator should call GraphXUtils. registerKryoClasses.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With