If I understand correctly, when a reduce task goes about gathering its input shuffle blocks ( from outputs of different map tasks ) it first keeps them in memory ( Q1 ). When the amount of shuffles-reserved memory of an executor ( before the change in memory management ( Q2 ) ) is exhausted, the in-memory data is "spilled" to disk. if spark.shuffle.spill.compress is true then that in-memory data is written to disk in a compressed fashion. My questions: Q0: Is my understanding correct? Q1: Is the gathered data inside the reduce task always uncompressed? Q2: How can I estimate the amount of executor memory available for gathering shuffle blocks? Q3: I've seen the claim "shuffle spill happens when your dataset cannot fit in memory", but to my understanding as long as the shuffle-reserved executor memory is big enough to contain all the ( uncompressed ) shuffle input blocks of all its ACTIVE tasks, then no spill should occur, is that correct? If so, to avoid spills one needs to make sure that the ( uncompressed ) data which ends up in all parallel reduce-side tasks is less than the executor's shuffle-reserved memory part?

There are differences in memory management in before and after 1.6. In both cases, there are notions of execution memory and storage memory. The difference is that before 1.6 it's static. Meaning there is a configuration parameter that specifies how much memory is for execution and for storage. And there is a spill, when either one is not enough. One of the issues that Apache Spark has to workaround is a concurrent execution of: <ul> <li>different stages that are executed in parallel</li> <li>different tasks like aggregation or sorting.</li> </ul> <ol> <li>I'd say that your understanding is correct.</li> <li>What's in memory is uncompressed or else it cannot be processed. Execution memory is spilled to disk in blocks and as you mentioned can be compressed.</li> <li>Well, since 1.3.1 you can configure it, then you know the size. As of what's left at any moment in time, you can see that by looking at the executor process with something like <code>jstat -gcutil <pid> <period></code>. It might give you a clue of how much memory is free there. Knowing how much memory is configured for storage and execution, having as little <code>default.parallelism</code> as possible might give you a clue.</li> <li>That's true, but it's hard to reason about; there might be skew in the data such as some keys have more values than the others, there are many parallel executions, etc.</li> </ol>

Understanding Spark shuffle spill

Tags:

apache-spark

If I understand correctly, when a reduce task goes about gathering its input shuffle blocks ( from outputs of different map tasks ) it first keeps them in memory ( Q1 ). When the amount of shuffles-reserved memory of an executor ( before the change in memory management ( Q2 ) ) is exhausted, the in-memory data is "spilled" to disk. if spark.shuffle.spill.compress is true then that in-memory data is written to disk in a compressed fashion.

My questions:

Q0: Is my understanding correct?

Q1: Is the gathered data inside the reduce task always uncompressed?

Q2: How can I estimate the amount of executor memory available for gathering shuffle blocks?

Q3: I've seen the claim "shuffle spill happens when your dataset cannot fit in memory", but to my understanding as long as the shuffle-reserved executor memory is big enough to contain all the ( uncompressed ) shuffle input blocks of all its ACTIVE tasks, then no spill should occur, is that correct?

If so, to avoid spills one needs to make sure that the ( uncompressed ) data which ends up in all parallel reduce-side tasks is less than the executor's shuffle-reserved memory part?

966

asked Jun 16 '16 00:06

Harel Gliksman

1 Answers

There are differences in memory management in before and after 1.6. In both cases, there are notions of execution memory and storage memory. The difference is that before 1.6 it's static. Meaning there is a configuration parameter that specifies how much memory is for execution and for storage. And there is a spill, when either one is not enough.

One of the issues that Apache Spark has to workaround is a concurrent execution of:

different stages that are executed in parallel
different tasks like aggregation or sorting.

I'd say that your understanding is correct.
What's in memory is uncompressed or else it cannot be processed. Execution memory is spilled to disk in blocks and as you mentioned can be compressed.
Well, since 1.3.1 you can configure it, then you know the size. As of what's left at any moment in time, you can see that by looking at the executor process with something like jstat -gcutil <pid> <period>. It might give you a clue of how much memory is free there. Knowing how much memory is configured for storage and execution, having as little default.parallelism as possible might give you a clue.
That's true, but it's hard to reason about; there might be skew in the data such as some keys have more values than the others, there are many parallel executions, etc.

answered Sep 20 '22 23:09

evgenii

Related questions
                            
                                How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0)
                            
                                What operations contribute to Spark Task Deserialization time?
                            
                                How to modify a Spark Dataframe with a complex nested structure?
                            
                                Distributed cross correlation matrix computation
                            
                                SBT test does not work for spark test
                            
                                Creating parquet files in spark with row-group size that is less than 100
                            
                                Spark/PySpark: An error occurred while trying to connect to the Java server (127.0.0.1:39543)
                            
                                why does filter remove null value by default on spark dataframe?
                            
                                Memory issue with spark structured streaming
                            
                                Storing multiple dataframes of different widths with Parquet?
                            
                                Does spark optimize identical but independent DAGs in pyspark?
                            
                                Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed
                            
                                Combine results from batch RDD with streaming RDD in Apache Spark
                            
                                real time log processing using apache spark streaming
                            
                                Spark streaming DStream RDD to get file name
                            
                                Create Spark DataFrame in Spark Streaming from JSON Message on Kafka
                            
                                Spark forcing log4j
                            
                                Accessing HDFS HA from spark job (UnknownHostException error)
                            
                                Spark worker memory
                            
                                Why is a Spark Row object so big compared to equivalent structures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With