Why does spark store intermediate data on disk during shuffle? I am trying to understand why it cannot store in memory. What are the challenges to write to memory?
Is some work being done to write it to Memory?
Operations which can cause a shuffle include repartition operations like repartition and coalesce, 'ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join. Ok so like every operation that everyone new to spark wants to do.
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory. Shuffle spill (disk) is the size of the serialized form of the data on disk.
Spark stores intermediate data on disk from a shuffle operation as part of its "under-the-hood" optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted.
The source of this answer is the O'Reilly book Learning Spark by Karau, Konwinski, Wendell & Zaharia. Chapter 8: Tuning and Debugging Spark. Section: Components of Execution: Jobs, Tasks, and Stages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With