Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Spark shuffle store intermediate data on disk?

Why does spark store intermediate data on disk during shuffle? I am trying to understand why it cannot store in memory. What are the challenges to write to memory?

Is some work being done to write it to Memory?

like image 322
Venkat Ankam Avatar asked Dec 04 '14 21:12

Venkat Ankam


People also ask

What causes a shuffle operation in Spark?

Operations which can cause a shuffle include repartition operations like repartition and coalesce, 'ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join. Ok so like every operation that everyone new to spark wants to do.

What is shuffle memory in Spark?

Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory. Shuffle spill (disk) is the size of the serialized form of the data on disk.


1 Answers

Spark stores intermediate data on disk from a shuffle operation as part of its "under-the-hood" optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted.

The source of this answer is the O'Reilly book Learning Spark by Karau, Konwinski, Wendell & Zaharia. Chapter 8: Tuning and Debugging Spark. Section: Components of Execution: Jobs, Tasks, and Stages.

like image 51
rainman Avatar answered Oct 18 '22 01:10

rainman