Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between "spark.yarn.executor.memoryOverhead" and "spark.memory.offHeap.size"

I am running spark on yarn. I don't understand what is the difference between the following settings spark.yarn.executor.memoryOverhead and spark.memory.offHeap.size. Both seem to be settings for allocating off-heap memory to spark executor. Which one should I use? Also what is the recommended setting for executor offheap memory?

Many thanks!

like image 252
Abdul Rahman Avatar asked Dec 23 '22 20:12

Abdul Rahman


1 Answers

TL;DR: For Spark 1.x and 2.x, Total Off-Heap Memory = spark.executor.memoryOverhead (spark.offHeap.size included within) For Spark 3.x, Total Off-Heap Memory = spark.executor.memoryOverhead + spark.offHeap.size (credit from this page)

Detailed explanation:

spark.executor.memoryOverhead is used by resource management like YARN, whereas spark.memory.offHeap.size is used by Spark core (memory manager). The relationship a bit different depending on the version.

Spark 2.4.5 and before:

spark.executor.memoryOverhead should include spark.memory.offHeap.size. This means that if you specify offHeap.size, you need to manually add this portion to memoryOverhead for YARN. As you can see from the code below from YarnAllocator.scala, when YARN request resource, it does not know anything about offHeap.size:

private[yarn] val resource = Resource.newInstance(
    executorMemory + memoryOverhead + pysparkWorkerMemory,
    executorCores)

However, the behavior is changed in Spark 3.0:

spark.executor.memoryOverhead does not include spark.memory.offHeap.size anymore. YARN will include offHeap.size for you when requesting resources. From the new documentation:

Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.

And from the code you can also tell:

private[yarn] val resource: Resource = {
    val resource = Resource.newInstance(
      executorMemory + executorOffHeapMemory + memoryOverhead + pysparkWorkerMemory, executorCores)
    ResourceRequestHelper.setResourceRequests(executorResourceRequests, resource)
    logDebug(s"Created resource capability: $resource")
    resource
  }

For more details of this change you can refer to this Pull Request.

For your second question, what is the recommended setting for executor offheap memory? It depends on your application and you need some testing. I found this page helpful to explain it further:

Off-heap memory is a great way to reduce GC pauses because it's not in the GC's scope. However, it brings an overhead of serialization and deserialization. The latter in its turn makes that the off-heap data can be sometimes put onto heap memory and hence be exposed to GC. Also, the new data format brought by Project Tungsten (array of bytes) helps to reduce the GC overhead. These 2 reasons make that the use of off-heap memory in Apache Spark applications should be carefully planned and, especially, tested.

BTW, spark.yarn.executor.memoryOverhead is deprecated and changed to spark.executor.memoryOverhead, which is common for YARN and Kubernetes.

like image 169
Joshua Chen Avatar answered Dec 28 '22 06:12

Joshua Chen