I am running spark on yarn. I don't understand what is the difference between the following settings spark.yarn.executor.memoryOverhead
and spark.memory.offHeap.size
. Both seem to be settings for allocating off-heap memory to spark executor. Which one should I use? Also what is the recommended setting for executor offheap memory?
Many thanks!
TL;DR: For Spark 1.x and 2.x, Total Off-Heap Memory = spark.executor.memoryOverhead
(spark.offHeap.size included within)
For Spark 3.x, Total Off-Heap Memory = spark.executor.memoryOverhead
+ spark.offHeap.size
(credit from this page)
Detailed explanation:
spark.executor.memoryOverhead
is used by resource management like YARN, whereas spark.memory.offHeap.size
is used by Spark core (memory manager). The relationship a bit different depending on the version.
Spark 2.4.5 and before:
spark.executor.memoryOverhead
should include spark.memory.offHeap.size
. This means that if you specify offHeap.size
, you need to manually add this portion to memoryOverhead
for YARN. As you can see from the code below from YarnAllocator.scala, when YARN request resource, it does not know anything about offHeap.size
:
private[yarn] val resource = Resource.newInstance(
executorMemory + memoryOverhead + pysparkWorkerMemory,
executorCores)
However, the behavior is changed in Spark 3.0:
spark.executor.memoryOverhead
does not include spark.memory.offHeap.size
anymore. YARN will include offHeap.size
for you when requesting resources. From the new documentation:
Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.
And from the code you can also tell:
private[yarn] val resource: Resource = {
val resource = Resource.newInstance(
executorMemory + executorOffHeapMemory + memoryOverhead + pysparkWorkerMemory, executorCores)
ResourceRequestHelper.setResourceRequests(executorResourceRequests, resource)
logDebug(s"Created resource capability: $resource")
resource
}
For more details of this change you can refer to this Pull Request.
For your second question, what is the recommended setting for executor offheap memory? It depends on your application and you need some testing. I found this page helpful to explain it further:
Off-heap memory is a great way to reduce GC pauses because it's not in the GC's scope. However, it brings an overhead of serialization and deserialization. The latter in its turn makes that the off-heap data can be sometimes put onto heap memory and hence be exposed to GC. Also, the new data format brought by Project Tungsten (array of bytes) helps to reduce the GC overhead. These 2 reasons make that the use of off-heap memory in Apache Spark applications should be carefully planned and, especially, tested.
BTW, spark.yarn.executor.memoryOverhead
is deprecated and changed to spark.executor.memoryOverhead
, which is common for YARN and Kubernetes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With