Does Kryo help in SparkSQL?

Tags:

Kryo helps improve the performance of Spark applications by the efficient serialization approach.
I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it.
In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2"), and the schema of DataFrame Row is not quite static.
Not sure how to register one or several serializer classes for the use case.

For example:

case class Info(name: String, address: String)
...
val df = spark.sparkContext.textFile(args(0))
         .map(_.split(','))
         .filter(_.length >= 2)
         .map {e => Info(e(0), e(1))}
         .toDF
df.select($"name") ... // followed by subsequent analysis
df.select($"address") ... // followed by subsequent analysis

I don't think it's a good idea to define case classes for each select.
Or does it help if I register Info like registerKryoClasses(Array(classOf[Info]))

985

asked Mar 14 '18 06:03

user6502167

1 Answers

According to Spark's documentation, SparkSQL does not uses Kryo or Java serializations.

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

They are much more lightweight than Java or Kryo, which is to be expected (it is a far more optimizable job to serialize, say a Row of 3 longs and two ints), than a class, its version description, its inner variables...) and having to instanciate it.

That being said, there is a way to use Kryo as an encoder implementation, see for example here : How to store custom objects in Dataset? . But this is meant as a solution to store custom objects (e.g. non product classes) in a Dataset, and not especially targeted at standard dataframes.

Without Kryo of Java serializers, creating encoders for custom, non product classes is somewhat limited (see the discussions on user defined types), for example, starting here : Does Apache spark 2.2 supports user-defined type (UDT)?

answered Oct 18 '22 02:10

GPI

Related questions
                            
                                reduce() vs. fold() in Apache Spark
                            
                                How to convert column to vector type?
                            
                                java.lang.OutOfMemoryError in pyspark
                            
                                Scala-Spark Dynamically call groupby and agg with parameter values
                            
                                How to count number of occurrences by using pyspark
                            
                                How to install Apache Toree for Spark Kernel in Jupyter in (ana)conda environment?
                            
                                Spark random forest binary classifier metrics
                            
                                Spark History Server on S3A FileSystem: ClassNotFoundException
                            
                                Hive on Spark list all partitions for specific hive table and adding a partition
                            
                                value read is not a member of org.apache.spark.SparkContext
                            
                                scala.MatchError: [Ljava.lang.String; (of class [Ljava.lang.String;)
                            
                                Inserting Data Into Cassandra table Using Spark DataFrame
                            
                                foreach function not working in Spark DataFrame
                            
                                Dropping columns by data type in Scala Spark
                            
                                Spark: unpersist RDDs for which I have lost the reference
                            
                                Redirect Spark console logs into a file
                            
                                How to expire state of dropDuplicates in structured streaming to avoid OOM?
                            
                                Workaround for importing spark implicits everywhere
                            
                                spark-submit Error: No main class set in JAR; please specify one with --class
                            
                                java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does Kryo help in SparkSQL?

Tags:

apache-spark

apache-spark-sql

kryo

user6502167

People also ask

1 Answers

GPI

Recent Activity

Donate For Us