I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:
Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persisted. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.
As I understand there are different persistence policies, for example, the default MEMORY_ONLY
which means the intermediate result will never be persisted to the disk.
When and why will a shuffle persist something on disk? How can that be reused by further computations?
It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled
This is an optimization. Shuffling is one of the expensive things that happen in Spark.
It is automatically reused with any subsequent action executed on the same RDD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With