I have a complex model written in Matlab. The model was not written by us and is best thought of as a "black box" i.e. in order to fix the relevant problems from the inside would require rewritting the entire model which would take years.
If I have an "embarrassingly parallel" problem I can use an array to submit X variations of the same simulation with the option #SBATCH --array=1-X
. However, clusters normally have a (frustratingly small) limit on the maximum array size.
Whilst using a PBS/TORQUE cluster I have got around this problem by forcing Matlab to run on a single thread, requesting multiple CPUs and then running multiple instances of Matlab in the background. An example submission script is:
#!/bin/bash
<OTHER PBS COMMANDS>
#PBS -l nodes=1:ppn=5,walltime=30:00:00
#PBS -t 1-600
<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON ARRAY NUMBER>
# define Matlab options
options="-nodesktop -noFigureWindows -nosplash -singleCompThread"
for sub_job in {1..5}
do
<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON LOOP NUMBER (i.e. sub_job)>
matlab ${options} -r "run_model(${arg1}, ${arg2}, ..., ${argN}); exit" &
done
wait
<TIDY UP AND FINISH COMMANDS>
Can anyone help me do the equivalent on a SLURM cluster?
par
function will not run my model in a parallel loop in Matlab.I am not a big expert on array jobs but I can help you with the inner loop.
I would always use GNU parallel to run several serial processes in parallel, within a single job that has more than one CPU available. It is a simple perl
script, so not difficult to 'install', and its syntax is extremely easy. What it basically does is to run some (nested) loop in parallel. Each iteration of this loop contains a (long) process, like your Matlab command. In contrast to your solution it does not submit all these processes at once, but it runs only N
processes at the same time (where N
is the number of CPUs you have available). As soon as one finishes, the next one is submitted, and so on until your entire loop is finished. It is perfectly fine that not all processes take the same amount of time, as soon as one CPU is freed, another process is started.
Then, what you would like to do is to launch 600 jobs (for which I substitute 3 below, to show the complete behavior), each with 5 CPUs. To do that you could do the following (whereby I have not included the actual run of matlab
, but that trivially can be included):
#!/bin/bash
#SBATCH --job-name example
#SBATCH --out job.slurm.out
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 5
#SBATCH --mem 512
#SBATCH --time 30:00:00
#SBATCH --array 1-3
cmd="echo matlab array=${SLURM_ARRAY_TASK_ID}"
parallel --max-procs=${SLURM_CPUS_PER_TASK} "$cmd,subjob={1}; sleep 30" ::: {1..5}
Submitting this job using:
$ sbatch job.slurm
submits 3 jobs to the queue. For example:
$ squeue | grep tdegeus
3395882_1 debug example tdegeus R 0:01 1 c07
3395882_2 debug example tdegeus R 0:01 1 c07
3395882_3 debug example tdegeus R 0:01 1 c07
Each job gets 5 CPUs. These are exploited by the parallel
command, to run your inner loop in parallel. Once again, the range of this inner loop may be (much) larger than 5, parallel
takes care of the balancing between the 5 available CPUs within this job.
Let's inspect the output:
$ cat job.slurm.out
matlab array=2,subjob=1
matlab array=2,subjob=2
matlab array=2,subjob=3
matlab array=2,subjob=4
matlab array=2,subjob=5
matlab array=1,subjob=1
matlab array=3,subjob=1
matlab array=1,subjob=2
matlab array=1,subjob=3
matlab array=1,subjob=4
matlab array=3,subjob=2
matlab array=3,subjob=3
matlab array=1,subjob=5
matlab array=3,subjob=4
matlab array=3,subjob=5
You can clearly see the 3 times 5 processes run at the same time now (as their output is mixed).
No need in this case to use srun
. SLURM will create 3 jobs. Within each job everything happens on individual compute nodes (i.e. as if you were running on your own system).
To 'install' GNU parallel into your home folder, for example in ~/opt
.
Download the latest GNU Parallel.
Make the directory ~/opt
if it does not yet exist
mkdir $HOME/opt
'Install' GNU Parallel:
tar jxvf parallel-latest.tar.bz2
cd parallel-XXXXXXXX
./configure --prefix=$HOME/opt
make
make install
Add ~/opt
to your path:
export PATH=$HOME/opt/bin:$PATH
(To make it permanent, add that line to your ~/.bashrc
.)
Use conda
.
(Optional) Create a new environment
conda create --name myenv
Load an existing environment:
conda activate myenv
Install GNU parallel:
conda install -c conda-forge parallel
Note that the command is available only when the environment is loaded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With