I have an executable that takes multiple options and multiple file inputs in order to run. The executable can be called with a variable number of cores to run.
E.g. executable -a -b -c -file fileA --file fileB ... --file fileZ --cores X
I'm trying to create an sbatch file that will enable me to have multiple calls of this executable with different inputs. Each call should be allocated in a different node (in parallel with the rest), using X cores. The parallelization at core level is taken care of the executable, while at the node level by SLURM.
I tried with ntasks and multiple sruns but the first srun was called multiple times.
Another take was to rename the files and use a SLURM process or node number as filename before the extension but it's not really practical.
Any insight on this?
i do these kind of jobs always with the help of bash script that i run by a sbatch command. The easiest approach would be to have a loop in a sbatch script where you spawn the different job and job steps under your executable with srun specifying i.e. the corresponding node name in your partion with -w . You may also read up the documentation of slurm array jobs (if that befits you better). Alternatively you could also store all parameter combinations in a file and than loop over them with the script of have a look at "array job" manual page.
Maybe the following script (i just wrapped it up) helps you to get a feeling for what i have in mind (i hope its what you need). Its not tested so dont just copy and paste it!
#!/bin/bash
parameter=(10 5 2)
node_names=(node1 node2 node3)
# lets run one job per node each time taking one parameter
for parameter in ${parameter[*]}
    # asign parameter to node
    #script some if else condition here to specify parameters
    # -w specifies the name of the node to use
    # -N specifies the amount of nodes
    JOBNAME="jmyjob$node-$parameter"
    # asign the first job to the node
    $node=${node_names[0]}
    #delete first node from list
    unset node_names[0];
    #reinstantiate list
    node_names=("${Unix[@]}")
    srun -N1 -w$node -psomepartition -JJOBNAME executable.sh model_parameter &
done;
You will have the problem that you need to force your sbatch script to wait for the last job step. In this case the follwoing additional while loop might help you.
# Wait for the last job step to complete
while true;
do
    # wait for last job to finish use the state of sacct for that
    echo "waiting for last job to finish"
    sleep 10
    # sacct shows your jobs, -R only running steps
    sacct -s R,gPD|grep "myjob*" #your job name indicator
    # check the status code of grep (1 if nothing found)
    if [ "$?" == "1" ];
    then
    echo "found no running jobs anymore"
    sacct -s R |grep "myjob*"
    echo "stopping loop"
    break;
    fi
done;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With