My input dataset is about 150G. I am setting
--conf spark.cores.max=100
--conf spark.executor.instances=20
--conf spark.executor.memory=8G
--conf spark.executor.cores=5
--conf spark.driver.memory=4G
but since data is not evenly distributed across executors, I kept getting
Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used
here are my questions:
1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?
Thank you!
When using yarn, there is another setting that figures into how big to make the yarn container request for your executors:
spark.yarn.executor.memoryOverhead
It defaults to 0.1 * your executor memory setting. It defines how much extra overhead memory to ask for in addition to what you specify as your executor memory. Try increasing this number first.
Also, a yarn container won't give you memory of an arbitrary size. It will only return containers allocated with a memory size that is a multiple of it's minimum allocation size, which is controlled by this setting:
yarn.scheduler.minimum-allocation-mb
Setting that to a smaller number will reduce the risk of you "overshooting" the amount you asked for.
I also typically set the below key to a value larger than my desired container size to ensure that the spark request is controlling how big my executors are, instead of yarn stomping on them. This is the maximum container size yarn will give out.
nodemanager.resource.memory-mb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With