I am running a job array with SLURM, with the following job array script (that I run with sbatch job_array_script.sh [args]
:
#!/bin/bash
#SBATCH ... other options ...
#SBATCH --array=0-1000%200
srun ./job_slurm_script.py $1 $2 $3 $4
echo 'open' > status_file.txt
To explain, I want job_slurm_script.py
to be run as an array job 1000 times with 200 tasks maximum in parallel. And when all of those are done, I want to write 'open' to status_file.txt
. This is because in reality I have more than 10,000 jobs, and this is above my cluster's MaxSubmissionLimit, so I need to split it into smaller chunks (at 1000-element job arrays) and run them one after the other (only when the previous one is finished).
However, for this to work, the echo statement can only trigger once the entire job array is finished (outside of this, I have a loop which checks status_file.txt
so see if the job is finished, i.e when the contents are the string 'open').
Up to now I thought that srun
holds the script up until the whole job array is finished. However, sometimes srun
"returns" and the script goes to the echo statement before the jobs are finished, so all the subsequent jobs bounce off the cluster since it goes above the submission limit.
So how do I make srun
"hold up" until the whole job array is finished?
After typing your srun command and options on the command line and pressing enter, Slurm will find and then allocate the resources you specified. Depending on what you specified, it can take a few minutes for Slurm to allocate those resources. You can view all of the srun options on the Slurm documentation website.
There are two ways of submitting a job to SLURM: Submit via a SLURM job script - create a bash script that includes directives to the SLURM scheduler. Submit via command-line options - provide directives to SLURM via command-line arguments.
Alternatively, you can cancel a job submitted by srun or in an interactive shell, with salloc, by pressing Ctrl-C . In the example below, we have asked to start an interactive job, which we then cancel during waiting. Note Do not kill/skill srun to cancel a SLURM job! Doing so only terminates srun .
You can add the flag --wait
to sbatch.
Check the manual page of sbatch for information about --wait
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With