I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times). Please find the spark stage details in the below image: <img src="https://i.stack.imgur.com/uOhE8.jpg" alt="enter image description here"> After researching on this, found that Shuffle spill happens when there is not sufficient memory for shuffle data. <code>Shuffle spill (memory)</code> - size of the deserialized form of the data in memory at the time of spilling <code>shuffle spill (disk)</code> - size of the serialized form of the data on disk after spilling Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more. Noticed that this spill memory size is incredibly large with big input data. My queries are: Does this spilling impacts the performance considerably? How to optimize this spilling both memory and disk? Are there any Spark Properties that can reduce/ control this huge spilling?

Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you. In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer. You can: <ol> <li>Manually <code>repartition()</code> your prior stage so that you have smaller partitions from input.</li> <li>Increase the shuffle buffer by increasing the memory in your executor processes (<code>spark.executor.memory</code>) </li> <li>Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (<code>spark.shuffle.memoryFraction</code>) from the default of 0.2. You need to give back <code>spark.storage.memoryFraction</code>. </li> <li>Increase the shuffle buffer per thread by reducing the ratio of worker threads (<code>SPARK_WORKER_CORES</code>) to executor memory</li> </ol> If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.

To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb) If your data is skewed, try tricks like salting the keys to increase parallelism. Read this to understand spark memory management: https://0x0fff.com/spark-memory-management/ https://www.tutorialdocs.com/article/spark-memory-management.html

How to optimize shuffle spill in Apache Spark application

2 Answers

Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.

In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.

You can:

Manually repartition() your prior stage so that you have smaller partitions from input.
Increase the shuffle buffer by increasing the memory in your executor processes (spark.executor.memory)
Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.
Increase the shuffle buffer per thread by reducing the ratio of worker threads (SPARK_WORKER_CORES) to executor memory

If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.

answered Sep 23 '22 16:09

Alister Lee

To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)

If your data is skewed, try tricks like salting the keys to increase parallelism.

Read this to understand spark memory management:

https://0x0fff.com/spark-memory-management/

https://www.tutorialdocs.com/article/spark-memory-management.html

answered Sep 22 '22 16:09

Prasad Sogalad

Related questions
                            
                                Get the size/length of an array column
                            
                                What is RDD in spark
                            
                                spark dataframe drop duplicates and keep first
                            
                                spark 2.1.0 session config settings (pyspark)
                            
                                What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
                            
                                Pyspark: Parse a column of json strings
                            
                                What is the difference between Apache Spark SQLContext vs HiveContext?
                            
                                Spark RDD to DataFrame python
                            
                                Efficient Count Distinct with Apache Spark
                            
                                Spark extracting values from a Row
                            
                                FetchFailedException or MetadataFetchFailedException when processing big data set
                            
                                How to debug Spark application locally?
                            
                                How do I unit test PySpark programs?
                            
                                Joining Spark dataframes on the key
                            
                                Spark 1.4 increase maxResultSize memory
                            
                                How to handle categorical features with spark-ml?
                            
                                Filtering a Pyspark DataFrame with SQL-like IN clause
                            
                                What is a task in Spark? How does the Spark worker execute the jar file?
                            
                                Difference between DataSet API and DataFrame API [duplicate]
                            
                                Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to optimize shuffle spill in Apache Spark application

Tags:

apache-spark

spark-streaming

apache-spark-1.4

Vijay Innamuri

People also ask

2 Answers

Alister Lee

Prasad Sogalad

Recent Activity

Donate For Us