Context
I run the following command in linux bash :
mono --debug --debugger-agent=transport=dt_socket,address=198.178.155.198:10000 ./Stress.exe
Stress.exe is a C# application.
What happens
At one point the system is out of memory, which is wanted. An error code is returned.
Error code returned (echo $?)
Code 1 : When my program creates a throw because it's out of memory.
Code 137 : When it is killed by OS when overloading memory.
Question
Why is it sometime the OS that kills my application? Why is the result not always the same?
If a few pods are consistently getting exit code 137 returned to them, then that is a sign that you need to increase the amount of space you afford to the pod. By increasing the maximum limit manually in the pods that are under the most strain, you'll be able to reduce the frequency with which this problem occurs.
When a container (Spark executor) runs out of memory, YARN automatically kills it. This causes the "Container killed on request. Exit code is 137" error.
Exit Code 1 indicates that a container shut down, either because of an application failure or because the image pointed to an invalid file. In a Unix/Linux operating system, when an application terminates with Exit Code 1, the operating system ends the process using Signal 7, known as SIGHUP.
An exit code, or sometimes known as a return code, is the code returned to a parent process by an executable. On POSIX systems the standard exit code is 0 for success and any number from 1 to 255 for anything else. Exit codes can be interpreted by machine scripts to adapt in the event of successes of failures.
Assuming:
Lets talk SGEN, so as you alloc objects, they are created in the nursery, as you run out of memory in the nursery, when the GC does a sweep and has to do a nursery collection as it is full, the live objects are move to it's major heap. If the major head is full, than more OS memory is requested. You can adjust the amount of initial memory allocated to your mono app and even fix the amount of memory (max) that Sgen can use. Also managed objects over 8000 bytes are handle by Sgen's Large object Space manager and that is non-nursery/major-heap based memory but it is still managed objects/memory.
So normally when mono needs more space for managed objects and does an OS request for an additional block and the OS says NO, you see the OutOfMemory exception and your 0 exit code. Your Stress test is happy.
But OOM is watching that mono process and adjusting it's score (oom_score) higher and higher. It could strike that mono process at any moment, but I would put the odds that it is right at the time of a GC sweep when the app threads are suspended by SGEN but before SGEN actually does a OS memory request due to a out of managed memory space in the nusery. Thus you get an exit of 137. 137 & 127 = 9, so the mono process was sent a SIGKILL signal (kill -9) and your Stress test is not happy.
Try this as an experiment:
This is not a Mono and/or Sgen/GC related 'issue' at all. Any process consuming more and more memory is subject to be OOM killed. Be it a big fat Oracle database or just an app/daemon that has a memory leak, etc.. they are all subject to kill'd.
"This article describes the Linux out-of-memory (OOM) killer and how to find out why it killed a particular process. It also provides methods for configuring the OOM killer to better suit the needs of many different environments."
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With