Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel but different Slurm srun job step invocations not working

Tags:

slurm

I'd like to run the same program on a large number of different input files. I could just submit each as a separate Slurm submission, but I don't want to swamp the queue by dumping 1000s of jobs on it at once. I've been trying to figure out how to process the same number of files by instead creating an allocation first, then within that allocation looping over all the files with srun, giving each invocation a single core from the allocation. The problem is that no matter what I do, only one job step runs at a time. The simplest test case I could come up with is:

#!/usr/bin/env bash

srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &

wait

It doesn't matter how many cores I assign the allocation:

time salloc -n 1 test
time salloc -n 2 test
time salloc -n 4 test

it always takes 4 seconds. Is it not possible to have multiple job steps execute in parallel?

like image 728
Cyclone Avatar asked Feb 19 '16 06:02

Cyclone


People also ask

What is the difference between Sbatch and Srun?

Answer. srun executes in interactive and blocking mode while sbatch executes in batch processing and non blocking mode. srun is mostly used to run immediate jobs but sbatch can be used for later execution of jobs. Note, that if your SSH session is interrupted for any reason, the srun will automatically be cancelled.

How do you use SRUN in Slurm?

After typing your srun command and options on the command line and pressing enter, Slurm will find and then allocate the resources you specified. Depending on what you specified, it can take a few minutes for Slurm to allocate those resources. You can view all of the srun options on the Slurm documentation website.

How many jobs can Slurm handle?

The limits in effect will be MaxJobs=20 and MaxSubmitJobs=50.

Which Slurm command is used to submit a batch job?

If you need this functionality, you can instead use the salloc command to get a Slurm job allocation, execute a command (such as srun or a shell script containing srun commands), and then, when the command finishes, enter exit to release the allocated resources.


1 Answers

It turned out to be that the default memory per cpu was not defined, so even single core jobs were running by reserving all the node's RAM.

Setting DefMemPerCPU, or specifying explicit RAM reservations did the trick.

like image 143
Cyclone Avatar answered Nov 15 '22 09:11

Cyclone