Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

Tags:

I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script.

At the end of the job, the process was killed and this is the error I obtained.

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.

My guess is that there is some issue with memory. But how can I know more about? Did I not provide enough memory? or as user I was requesting more than what I have access to?

Any suggestion?

like image 960
CafféSospeso Avatar asked Sep 20 '18 08:09

CafféSospeso


People also ask

What causes OOM killer?

The Out Of Memory Killer or OOM Killer is a process that the linux kernel employs when the system is critically low on memory. This situation occurs because the linux kernel has over allocated memory to its processes. When a process starts it requests a block of memory from the kernel.

Which signal does the OOM killer send to kill the process?

When one or more processes are selected, then OOM-Killer calls the oom_kill_task() function. This function is responsible to send the terminate/kill signal to the process. In case of out of memory oom_kill() call this function so, it can send the SIGKILL signal to the process. A kernel log message is generated.

What invokes OOM killer?

OOM Killer is special process invoked by kernel when system is critically low on memory. This occurs when processes consume large amount of memory and system requires more memory for its own processes. When process starts, it requests block of memory from kernel.


2 Answers

Here OOM stands for "Out of Memory". When Linux runs low on memory, it will "oom-kill" a process to keep critical processes running. It looks like slurmstepd detected that your process was oom-killed. Oracle has a nice explanation of this mechanism.

If you had requested more memory than you were allowed, the process would not have been allocated to a node and computation would not have started. It looks like you need to request more memory.

like image 129
Kyle Avatar answered Sep 16 '22 21:09

Kyle


The approved answer is correct but, to be more precise, error

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.

indicates that you are low on Linux's CPU RAM memory.

If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like this (example for PyTorch):

RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached)

Check out the explanation in this article for more details.

Solution: Increase or add in your script parameter --mem-per-cpu.

1) If you are using sbatch: sbatch your_script.sh to run your script, add in it following line:

#SBATCH --mem-per-cpu=<value bigger than you've requested before>

2) If you are using sran: sran python3 your_script.py add this parameter like this:

sran --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py
like image 33
kaspiotr Avatar answered Sep 17 '22 21:09

kaspiotr