I'd like to run the same program on a large number of different input files. I could just submit each as a separate Slurm submission, but I don't want to swamp the queue by dumping 1000s of jobs on it at once. I've been trying to figure out how to process the same number of files by instead creating an allocation first, then within that allocation looping over all the files with srun, giving each invocation a single core from the allocation. The problem is that no matter what I do, only one job step runs at a time. The simplest test case I could come up with is:
#!/usr/bin/env bash
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
wait
It doesn't matter how many cores I assign the allocation:
time salloc -n 1 test
time salloc -n 2 test
time salloc -n 4 test
it always takes 4 seconds. Is it not possible to have multiple job steps execute in parallel?
Answer. srun executes in interactive and blocking mode while sbatch executes in batch processing and non blocking mode. srun is mostly used to run immediate jobs but sbatch can be used for later execution of jobs. Note, that if your SSH session is interrupted for any reason, the srun will automatically be cancelled.
After typing your srun command and options on the command line and pressing enter, Slurm will find and then allocate the resources you specified. Depending on what you specified, it can take a few minutes for Slurm to allocate those resources. You can view all of the srun options on the Slurm documentation website.
The limits in effect will be MaxJobs=20 and MaxSubmitJobs=50.
If you need this functionality, you can instead use the salloc command to get a Slurm job allocation, execute a command (such as srun or a shell script containing srun commands), and then, when the command finishes, enter exit to release the allocated resources.
It turned out to be that the default memory per cpu was not defined, so even single core jobs were running by reserving all the node's RAM.
Setting DefMemPerCPU, or specifying explicit RAM reservations did the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With