Does Spark write intermediate shuffle outputs to disk

Question

I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:

Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persisted. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.

As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.

When and why will a shuffle persist something on disk? How can that be reused by further computations?

2 revsuser6022341 · Accepted Answer

When

It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled

Why

This is an optimization. Shuffling is one of the expensive things that happen in Spark.

How that can be reused by further computations?

It is automatically reused with any subsequent action executed on the same RDD.

Does Spark write intermediate shuffle outputs to disk

Tags:

apache-spark

rdd

VB_

1 Answers

When

Why

How that can be reused by further computations?

2 revsuser6022341

Recent Activity

Donate For Us

Does Spark write intermediate shuffle outputs to disk

Tags:

apache-spark

rdd

VB_

1 Answers

When

Why

How that can be reused by further computations?

2 revsuser6022341

Related questions

Recent Activity

Donate For Us