I am trying to join two large spark dataframes and keep running into this error:
Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This seems like a common issue among spark users, but I can't seem to find any solid descriptions of what spark.yarn.executor.memoryOverheard is. In some cases it sounds like it's a kind of memory buffer before YARN kills the container (e.g. 10GB was requested, but YARN won't kill the container until it uses 10.2GB). In other cases it sounds like it's being used to to do some kind of data accounting tasks that are completely separate from the analysis that I want to perform. My questions are:
memoryOverhead property is added to the executor memory to determine the full memory request to YARN for each executor. It defaults to max(executorMemory * 0.10, with minimum of 384).
The spark. driver. memoryOverHead enables you to set the memory utilized by every Spark driver process in cluster mode. This is the memory that accounts for things like VM overheads, interned strings, other native overheads, etc. – it tends to grow with the executor size (typically 6-10%).
Broadly set the memory between 8GB and 16GB. This is an arbitrary choice and governed by the above two points. Pack as many executors as can be assigned to one cluster node. Evenly distribute cores to all executors.
Number of executors per node = 30/10 = 3. Memory per executor = 64GB/3 = 21GB. Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB.
Overhead options are nicely explained in the configuration document:
This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
This also includes user objects if you use one of the non-JVM guest languages (Python, R, etc...).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With