How does Spark running on YARN account for Python memory usage?

Tags:

After reading through the documentation I do not understand how does Spark running on YARN account for Python memory consumption.

Does it count towards spark.executor.memory, spark.executor.memoryOverhead or where?

In particular I have a PySpark application with spark.executor.memory=25G, spark.executor.cores=4 and I encounter frequent Container killed by YARN for exceeding memory limits. errors when running a map on an RDD. It operates on a fairly large amount of complex Python objects so it is expected to take up some non-trivial amount of memory but not 25GB. How should I configure the different memory variables for use with heavy Python code?

873

asked Oct 05 '16 16:10

domkck

1 Answers

I'd try to increase memory to spark.python.worker.memory default (512m) because of heavy Python code and this property value does not count in spark.executor.memory.

Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks. link

ExecutorMemoryOverhead calculation in Spark:

MEMORY_OVERHEAD_FRACTION = 0.10  MEMORY_OVERHEAD_MINIMUM = 384  val executorMemoryOverhead =    max(MEMORY_OVERHEAD_FRACTION * ${spark.executor.memory}, MEMORY_OVERHEAD_MINIMUM))

The property is spark.{yarn|mesos}.executor.memoryOverhead for YARN and Mesos.

YARN kills the processes which are taking more memory than they requested which is sum of executorMemoryOverhead and executorMemory.

In given image python processes in worker uses spark.python.worker.memory, then spark.yarn.executor.memoryOverhead + spark.executor.memory is specific JVM.

Image credits

Additional resource Apache mailing thread

183

answered Sep 18 '22 14:09

mrsrinivas

Related questions
                            
                                Python asynchronous/threads debugging in Visual Studio Code
                            
                                Presentation mode for Google Colab notebooks
                            
                                Cannot run tflite model on GPU (Jetson Nano) using Python
                            
                                pip install package from url
                            
                                Chopping media stream in HTML5 websocket server for webbased chat/video conference application
                            
                                Webapp2 for Authentication and Login
                            
                                Better understand SQLalchemy's `yield_per()` problems
                            
                                Socket.io POST Requests from Socket.IO-Client-Swift
                            
                                Describe your customized Vim editor for Python/Django development?
                            
                                How can I prevent sphinx from displaying the full path to my class?
                            
                                Detect if a python module changes and then reload
                            
                                SQLAlchemy and Multiple Databases
                            
                                Easy application logging/debugging with nginx, uwsgi, flask?
                            
                                how we can create android apk for python based application [duplicate]
                            
                                Java's TreeSet equivalent in Python?
                            
                                What is the correct way to override the __dir__ method?
                            
                                Non-blocking file access with Twisted
                            
                                Distribute a Python package with a compiled dynamic shared library
                            
                                Integration of Python console into a GUI C++ application
                            
                                why is converting a long 2D list to numpy array so slow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Spark running on YARN account for Python memory usage?

Tags:

python

apache-spark

hadoop

pyspark

hadoop-yarn

domkck

People also ask

1 Answers

ExecutorMemoryOverhead calculation in Spark:

mrsrinivas

Recent Activity

Donate For Us