Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

Tags:

I have the following spark job, trying to keep everything in memory:

val myOutRDD = myInRDD.flatMap { fp =>   val tuple2List: ListBuffer[(String, myClass)] = ListBuffer()         :    tuple2List }.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) =>    myMergeFunction(p1,p2) }.persist(StorageLevel.MEMORY_ONLY)

However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ...

Total task time across all tasks: 49.1 h Input Size / Records: 21.6 GB / 102123058 Shuffle write: 532.9 GB / 182440290 Shuffle spill (memory): 370.7 GB Shuffle spill (disk): 15.4 GB

Then the job failed because "no space left on device" ... I am wondering for the 532.9 GB Shuffle write here, is it written to disk or memory?

Also, why there are still 15.4 G data spill to the disk while I specifically ask to keep them in the memory?

Thanks!

756

asked Aug 25 '15 17:08

Edamame

1 Answers

The persist calls in your code are entirely wasted if you don't access the RDD multiple times. What's the point of storing something if you never access it? Caching has no bearing on shuffle behavior other than you can avoid re-doing shuffles by keeping their output cached.

Shuffle spill is controlled by the spark.shuffle.spill and spark.shuffle.memoryFraction configuration parameters. If spill is enabled (it is by default) then shuffle files will spill to disk if they start using more than given by memoryFraction (20% by default).

The metrics are very confusing. My reading of the code is that "Shuffle spill (memory)" is the amount of memory that was freed up as things were spilled to disk. The code for "Shuffle spill (disk)" looks like it's the amount actually written to disk. By the code for "Shuffle write" I think it's the amount written to disk directly — not as a spill from a sorter.

answered Oct 09 '22 22:10

Daniel Darabos

Related questions
                            
                                Spark: Efficient way to test if an RDD is empty
                            
                                Save content of Spark DataFrame as a single CSV file [duplicate]
                            
                                Passing Array to Spark Lit function
                            
                                Triggering spark jobs with REST
                            
                                Why is Apache-Spark - Python so slow locally as compared to pandas?
                            
                                PySpark Drop Rows
                            
                                Retrieve SparkContext from SparkSession
                            
                                java.lang.ClassCastException using lambda expressions in spark job on remote server
                            
                                How to use orderby() with descending order in Spark window functions?
                            
                                Exploding nested Struct in Spark dataframe
                            
                                How to create a sample single-column Spark DataFrame in Python?
                            
                                How does Distinct() function work in Spark?
                            
                                How to replace null values with a specific value in Dataframe using spark in Java?
                            
                                How do I replace a string value with a NULL in PySpark?
                            
                                SparkSQL - Read parquet file directly
                            
                                How to make shark/spark clear the cache?
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                PySpark Logging?
                            
                                Merge Spark output CSV files with a single header
                            
                                Reading multiple files from S3 in Spark by date period

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

Tags:

shuffle

persist

apache-spark

rdd

Edamame

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us