I am running a multiprogrammed workload (based on SPEC CPU2006 benchmarks) on a POWER7 system using SUSE SLES 11.
Sometimes, each application in the workload consumes a significant amount of memory and the total memory footprint exceeds the available memory installed in the system (32 GB).
I disabled the swap since otherwise the measurements could be heavily affected for the processes using the swap. I know that by doing that the kernel, through the OOM killer, may kill some of the processes. That is totally fine. The problem is that I would expect that a thread killed by the kernel exited with an error condition (e.g., the process was terminated by a signal).
I have a framework that launches all the processes and then waits for them using
waitpid(pid, &status, 0);
Even if a thread is killed by the OOM killer (I know that since I get a message in the screen and in /var/log/messages), the call
WIFEXITED(status);
returns one, and the call
WEXITSTATUS(status);
returns zero. Therefore, I am not able to distinguish when a process finishes correctly and when it is killed by the OOM killer.
Am I doing anything wrong? Do you know any way to detect when a process has been killed by the OOM killer.
I found this post asking pretty much the same question. However, since it is an old post and answers were not satisfactory, I decided to post a new question.
If you receive a notification that an 'out of memory' event has occurred, the OOM-killer process will already have done its job and you'll see that memory has been freed up. So in order to find out what happened, we'll have to inspect the logs. You can use the command less /var/log/kern.
The server runs the risk of crashing because it ran out of memory. To prevent the server from reaching that critical state, the kernel also contains a process known as the OOM Killer. The kernel uses this process to start killing non-essential processes so the server can remain operational.
When is OOM Killer invoked? OOM Killer is invoked when system is low on memory. Solution for overpopulated memory is OOM Killer which, when called, reviews all running processes and kills one or more of them (based on oom_score file) in order to free up system memory and keep system running.
The Linux OOM killer works by sending SIGKILL
. If your process is killed by the OOM it's fishy that WIFEXITED
returns 1.
TLPI
To kill the selected process, the OOM killer delivers a SIGKILL signal.
So you should be able to test this using:
if (WIFSIGNALED(status)) {
if (WTERMSIG(status) == SIGKILL)
printf("Killed by SIGKILL\n");
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With