parallel but different Slurm srun job step invocations not working



I'd like to run the same program on a large number of different input files. I could just submit each as a separate Slurm submission, but I don't want to swamp the queue by dumping 1000s of jobs on it at once. I've been trying to figure out how to process the same number of files by instead creating an allocation first, then within that allocation looping over all the files with srun, giving each invocation a single core from the allocation. The problem is that no matter what I do, only one job step runs at a time. The simplest test case I could come up with is:

#!/usr/bin/env bash

srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &


It doesn't matter how many cores I assign the allocation:

time salloc -n 1 test
time salloc -n 2 test
time salloc -n 4 test

it always takes 4 seconds. Is it not possible to have multiple job steps execute in parallel?

1 Answers

It turned out to be that the default memory per cpu was not defined, so even single core jobs were running by reserving all the node's RAM.

Setting DefMemPerCPU, or specifying explicit RAM reservations did the trick.

