Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why a encoder is needed for creating dataset in spark

I wanted to write the output file in parquet form. For that, I converted the RDD to dataset since from RDD, we cannot get the parquet form directly. And for creating the dataset, we need to use the implicit encoder otherwise, it start giving compile time error. I have few questions in this regard only. Following is my code:

implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[ItemData]
    val ds: Dataset[ItemData] = sparkSession.createDataset(filteredRDD)

    ds.write
      .mode(SaveMode.Overwrite)
      .parquet(configuration.outputPath)
  }

Following are my questions:

  1. Why is it important to use encoder while creating the dataset? And what does this encoder do?
  2. From the above code, when I get the output file in parquet form, I see it in encoded form. How can I decode it? When I decode it using base64 form, I get the following: com.........processor.spark.ItemDat"0156028263

So, basically it is showing me object.toString() kind of value.

like image 326
hatellla Avatar asked Jan 27 '23 23:01

hatellla


1 Answers

From documentation:

createDataset requires an encoder to convert a JVM object of type T to and from the internal Spark SQL representation.

From Heather Miller's course:

Basically, encoders are what convert your data between JVM objects and Spark SQL's specialized internal (tabular) representation. They're required by all Datasets!

Encoders are highly specialized and optimized code generators that generate custom bytecode for serialization and deserialization of your data.

I believe that it is now clear what encoders are and what they do. Regarding to your second question, Kryo serializer leads to Spark storing every row in the dataset as a flat binary object. Instead of using Java or Kryo serializer, you can use Spark's internal encoders. You can use it automatically via spark.implicits._. It also uses less memory than Kryo/Java serialization.

UPDATE I

Based on your comment, here are the things that set Spark Encoders apart from regular Java and Kryo serialization (from Heather Miller's Course):

  • Limited to and optimal for primitives and case classes, Spark SQL data types.
  • They contains schema information, which makes these highly optimized code generators possible, and enables optimization based on the shape of the data. Since Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets.
  • >10x faster than Kryo serialization (Java serialization orders of magnitude slower)

I hope it helps!

like image 104
ulubeyn Avatar answered Feb 01 '23 17:02

ulubeyn