Require kryo serialization in Spark (Scala)

Tags:

kryo

I have kryo serialization turned on with this:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

I want to ensure that a custom class is serialized using kryo when shuffled between nodes. I can register the class with kryo this way:

Click to copy

conf.registerKryoClasses(Array(classOf[Foo]))

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

To guarantee that kryo serialization happens, I followed this recommendation from the Spark documentation:

Click to copy

conf.set("spark.kryo.registrationRequired", "true")

But this causes IllegalArugmentException to be thrown ("Class is not registered") for a bunch of different classes which I assume Spark uses internally, for example the following:

Click to copy

org.apache.spark.util.collection.CompactBuffer scala.Tuple3

Surely I do not have to manually register each of these individual classes with kryo? These serializers are all defined in kryo, so is there a way to automatically register all of them?

560

asked Jul 13 '15 21:07

1 Answers

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

No. If you set spark.serializer to org.apache.spark.serializer. KryoSerializer then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback.

So what is this Kryo registration then?

When Kryo serializes an instance of an unregistered class it has to output the fully qualified class name. That's a lot of characters. Instead, if a class has been pre-registered, Kryo can just output a numeric reference to this class, which is just 1-2 bytes.

This is especially crucial when each row of an RDD is serialized with Kryo. You don't want to include the same class name for each of a billion rows. So you pre-register these classes. But it's easy to forget to register a new class and then you're wasting bytes again. The solution is to require every class to be registered:

Click to copy

conf.set("spark.kryo.registrationRequired", "true")

Now Kryo will never output full class names. If it encounters an unregistered class, that's a runtime error.

Unfortunately it's hard to enumerate all the classes that you are going to be serializing in advance. The idea is that Spark registers the Spark-specific classes, and you register everything else. You have an RDD[(X, Y, Z)]? You have to register classOf[scala.Tuple3[_, _, _]].

The list of classes that Spark registers actually includes CompactBuffer, so if you see an error for that, you're doing something wrong. You are bypassing the Spark registration procedure. You have to use either spark.kryo.classesToRegister or spark.kryo.registrator to register your classes. (See the config options. If you use GraphX, your registrator should call GraphXUtils. registerKryoClasses.)

189

answered Sep 28 '22 08:09

Daniel Darabos

Related questions
                            
                                How to bin in PySpark?
                            
                                How to write to CSV in Spark
                            
                                fetch more than 20 rows and display full value of column in spark-shell
                            
                                Pyspark filter dataframe by columns of another dataframe
                            
                                Spark: How to translate count(distinct(value)) in Dataframe API's
                            
                                Do exit codes and exit statuses mean anything in spark?
                            
                                Apache Spark vs Apache Ignite [closed]
                            
                                How to load IPython shell with PySpark
                            
                                pyspark: count distinct over a window
                            
                                Calculating duration by subtracting two datetime columns in string format
                            
                                Spark DataFrame: count distinct values of every column
                            
                                PySpark serialization EOFError
                            
                                Which of the many Spark/Scala kernels for Jupyter/IPython to choose? [closed]
                            
                                Pandas dataframe to Spark dataframe "Can not merge type error"
                            
                                How to specify the version of Python for spark-submit to use?
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                How do I add an persistent column of row ids to Spark DataFrame?
                            
                                Pyspark: repartition vs partitionBy
                            
                                How to log using log4j to local file system inside a Spark application that runs on YARN?
                            
                                Perform a typed join in Scala with Spark Datasets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Require kryo serialization in Spark (Scala)

Tags:

apache-spark

kryo

pheaver

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us