Why increase spark.yarn.executor.memoryOverhead?

Tags:

hadoop-yarn

I am trying to join two large spark dataframes and keep running into this error:

Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

This seems like a common issue among spark users, but I can't seem to find any solid descriptions of what spark.yarn.executor.memoryOverheard is. In some cases it sounds like it's a kind of memory buffer before YARN kills the container (e.g. 10GB was requested, but YARN won't kill the container until it uses 10.2GB). In other cases it sounds like it's being used to to do some kind of data accounting tasks that are completely separate from the analysis that I want to perform. My questions are:

What is the spark.yarn.executor.memoryOverhead being using for?
What is the benefit of increasing this kind of memory instead of executor memory (or the number of executors)?
In general, are there things steps I can take to reduce my spark.yarn.executor.memoryOverhead usage (e.g. particular datastructures, limiting the width of the dataframes, using fewer executors with more memory, etc)?

846

asked Apr 23 '18 19:04

Fortunato

1 Answers

Overhead options are nicely explained in the configuration document:

This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).

This also includes user objects if you use one of the non-JVM guest languages (Python, R, etc...).

165

answered Sep 26 '22 09:09

Alper t. Turker

Related questions
                            
                                Upload zip file using --archives option of spark-submit on yarn
                            
                                Removing empty strings from maps in scala
                            
                                idea sbt java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
                            
                                How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
                            
                                "Bad substitution" when submitting spark job to yarn-cluster
                            
                                PySpark: when function with multiple outputs [duplicate]
                            
                                Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary
                            
                                Spark LDA consumes too much memory
                            
                                apache spark "Py4JError: Answer from Java side is empty"
                            
                                SparkUI for pyspark - corresponding line of code for each stage?
                            
                                How to read/write protocol buffer messages with Apache Spark?
                            
                                In Apache Spark, how to convert a slow RDD/dataset into a stream?
                            
                                What is happening when Spark is calling ShuffleBlockFetcherIterator?
                            
                                spark parquet write gets slow as partitions grow
                            
                                Unable to understand error "SparkListenerBus has already stopped! Dropping event ..."
                            
                                How are number of iterations and number of partitions releated in Apache spark Word2Vec?
                            
                                Spark: Difference between collect(), take() and show() outputs after conversion toDF
                            
                                Spark: Most efficient way to sort and partition data to be written as parquet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With