In Spark Programming guide, Serializing RDD is mentioned as one of Techniques to to decrease memory usage. As per my understanding Serialization is the conversion of an object to bytes, so that the object can be easily saved to storage.So how does it occupy less space?
With Spark version 2.x.x, as it mentioned in the memory tuning document, Java objects have overhead over raw data such as a pointer to class, collections using wrapper objects or boxed objects for collections of primitive types. These overheads are not stored when objects are serialized.
But since data is stored as a serialized byte array in the partition, it will need to be deserialized for usage and it may be time-consuming.
https://spark.apache.org/docs/latest/tuning.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With