Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Spark write intermediate shuffle outputs to disk

I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:

Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persisted. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.

As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.

When and why will a shuffle persist something on disk? How can that be reused by further computations?

like image 525
VB_ Avatar asked Dec 03 '16 16:12

VB_


1 Answers

When

It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled

Why

This is an optimization. Shuffling is one of the expensive things that happen in Spark.

How that can be reused by further computations?

It is automatically reused with any subsequent action executed on the same RDD.

like image 102
2 revsuser6022341 Avatar answered Sep 24 '22 03:09

2 revsuser6022341