I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script. At the end of the job, the process was killed and this is the error I obtained. <pre class="prettyprint"><code>slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup. </code></pre> My guess is that there is some issue with memory. But how can I know more about? Did I not provide enough memory? or as user I was requesting more than what I have access to? Any suggestion?

The approved answer is correct but, to be more precise, error <pre class="prettyprint"><code>slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup. </code></pre> indicates that you are low on Linux's CPU RAM memory. If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like this (example for PyTorch): <pre class="prettyprint"><code>RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached) </code></pre> Check out the explanation in this article for more details. Solution: Increase or add in your script parameter <code>--mem-per-cpu</code>. 1) If you are using sbatch: <code>sbatch your_script.sh</code> to run your script, add in it following line: <pre class="prettyprint"><code>#SBATCH --mem-per-cpu=<value bigger than you've requested before> </code></pre> 2) If you are using sran: <code>sran python3 your_script.py</code> add this parameter like this: <pre class="prettyprint"><code>sran --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py </code></pre>

Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

Tags:

I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script.

At the end of the job, the process was killed and this is the error I obtained.

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.

My guess is that there is some issue with memory. But how can I know more about? Did I not provide enough memory? or as user I was requesting more than what I have access to?

Any suggestion?

960

asked Sep 20 '18 08:09

CafféSospeso

2 Answers

Here OOM stands for "Out of Memory". When Linux runs low on memory, it will "oom-kill" a process to keep critical processes running. It looks like slurmstepd detected that your process was oom-killed. Oracle has a nice explanation of this mechanism.

If you had requested more memory than you were allowed, the process would not have been allocated to a node and computation would not have started. It looks like you need to request more memory.

129

answered Sep 16 '22 21:09

Kyle

The approved answer is correct but, to be more precise, error

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.

indicates that you are low on Linux's CPU RAM memory.

If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like this (example for PyTorch):

RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached)

Check out the explanation in this article for more details.

Solution: Increase or add in your script parameter --mem-per-cpu.

1) If you are using sbatch: sbatch your_script.sh to run your script, add in it following line:

#SBATCH --mem-per-cpu=<value bigger than you've requested before>

2) If you are using sran: sran python3 your_script.py add this parameter like this:

sran --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py

answered Sep 17 '22 21:09

kaspiotr

Related questions
                            
                                Xcode 10 Crashlytics Installation
                            
                                React.js: the most efficient way to pass a parameter to an event handler without bind() in a component
                            
                                How can I get time.Time in Golang protobuf v3 struct?
                            
                                DynamicFrame vs DataFrame
                            
                                How to prevent scales::percent from adding decimal
                            
                                Edge: SCRIPT1028: Expected identifier, string or number
                            
                                Typescript styled-component error: "Type '{ children: string; }' has no properties in common with type 'IntrinsicAttributes'."
                            
                                How to deobfuscate an Android stacktrace using mapping file
                            
                                Mono.Defer() vs Mono.create() vs Mono.just()?
                            
                                What does a key value pair inside the square brackets [] mean?
                            
                                @SpringBootTest vs @ContextConfiguration vs @Import in Spring Boot Unit Test
                            
                                How to implement behavior subject using service in Angular 8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With