Why does Spark shuffle store intermediate data on disk?

1 Answers

Spark stores intermediate data on disk from a shuffle operation as part of its "under-the-hood" optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted.

The source of this answer is the O'Reilly book Learning Spark by Karau, Konwinski, Wendell & Zaharia. Chapter 8: Tuning and Debugging Spark. Section: Components of Execution: Jobs, Tasks, and Stages.

answered Oct 18 '22 01:10

rainman

Related questions
                            
                                Pyspark: shuffle RDD
                            
                                VectorAssembler output only to DenseVector?
                            
                                Spark - Shuffle Read Blocked Time
                            
                                DataFrame partitionBy on nested columns
                            
                                PySpark distributing module imports
                            
                                Spark problems with imports in Python
                            
                                Divide elements of column by a sum of elements (of same column) grouped by elements of another column
                            
                                What algorithm is used in spark decision tree (is ID3, C4.5 or CART)
                            
                                Delete files after processing with Spark Structured Streaming
                            
                                Spark build in hive MySQL metastore isn't being used
                            
                                PySpark: PicklingError: Could not serialize object: TypeError: can't pickle CompiledFFI objects
                            
                                Spark 2.2.0 - How to write/read DataFrame to DynamoDB
                            
                                PySpark Window Function: multiple conditions in orderBy on rangeBetween/rowsBetween
                            
                                best practice for debugging python-spark code
                            
                                How SBT test task manages class path and how to correctly start a Java process from SBT test
                            
                                Why spark executor cores are not equal with active tasks in spark web UI？
                            
                                The group member's supported protocols are incompatible with those of existing members
                            
                                How can I convince spark not to make an exchange when the join key is a super-set of the bucketBy key?
                            
                                Can AWS Glue crawl Delta Lake table data?
                            
                                Spark atop of Docker not accepting jobs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Spark shuffle store intermediate data on disk?

Tags:

shuffle

apache-spark

Venkat Ankam

People also ask

1 Answers

rainman

Recent Activity

Donate For Us