I wanted to write the output file in parquet form. For that, I converted the RDD to dataset since from RDD, we cannot get the parquet form directly. And for creating the dataset, we need to use the implicit encoder otherwise, it start giving compile time error. I have few questions in this regard only. Following is my code:
implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[ItemData]
val ds: Dataset[ItemData] = sparkSession.createDataset(filteredRDD)
ds.write
.mode(SaveMode.Overwrite)
.parquet(configuration.outputPath)
}
Following are my questions:
So, basically it is showing me object.toString() kind of value.
From documentation:
createDataset
requires an encoder to convert a JVM object of typeT
to and from the internal Spark SQL representation.
From Heather Miller's course:
Basically, encoders are what convert your data between JVM objects and Spark SQL's specialized internal (tabular) representation. They're required by all Datasets!
Encoders are highly specialized and optimized code generators that generate custom bytecode for serialization and deserialization of your data.
I believe that it is now clear what encoders are and what they do. Regarding to your second question, Kryo
serializer leads to Spark storing every row in the dataset as a flat binary object. Instead of using Java
or Kryo
serializer, you can use Spark's internal encoders. You can use it automatically via spark.implicits._
. It also uses less memory than Kryo
/Java
serialization.
UPDATE I
Based on your comment, here are the things that set Spark Encoders apart from regular Java
and Kryo
serialization (from Heather Miller's Course):
- Limited to and optimal for primitives and case classes, Spark SQL data types.
- They contains schema information, which makes these highly optimized code generators possible, and enables optimization based on the shape of the data. Since Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets.
- >10x faster than
Kryo
serialization (Java
serialization orders of magnitude slower)
I hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With