I have have a bash script submit.sh for submitting training jobs to a Slurm server. It works as follows. Doing
bash submit.sh p1 8 config_file
will submit some task corresponding to config_file to 8 GPUs of partition p1. Each node of p1 has 4 GPUs, thus this command requests 2 nodes.
The content of submit.sh can be summarized as follows, in which I use sbatch to submit a Slurm script (train.slurm):
#!/bin/bash
# submit.sh
PARTITION=$1
NGPUs=$2
CONFIG=$3
NGPUS_PER_NODE=4
NCPUS_PER_TASK=10
sbatch --partition ${PARTITION} \
--job-name=${CONFIG} \
--output=logs/${CONFIG}_%j.log \
--ntasks=${NGPUs} \
--ntasks-per-node=${NGPUS_PER_NODE} \
--cpus-per-task=${NCPUS_PER_TASK} \
--gres=gpu:${NGPUS_PER_NODE} \
--hint=nomultithread \
--time=10:00:00
--export=CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
train.slurm
Now in the Slurm script, train.slurm, I decide whether to launch the training Python script on one or multiple nodes (the ways to launch it are different in these two cases):
#!/bin/bash
# train.slurm
#SBATCH --distribution=block:block
# Load Python environment
module purge
module load pytorch/py3/1.6.0
set -x
if [ ${NGPUs} -gt ${NGPUS_PER_NODE} ]; then # Multi-node training
# Some variables needed for the training script
export MASTER_PORT=12340
export WORLD_SIZE=${NGPUs}
# etc.
srun python train.py --cfg ${CONFIG}
else # Single-node training
python -u -m torch.distributed.launch --nproc_per_node=${NGPUS_PER_NODE} --use_env train.py --cfg ${CONFIG}
fi
Now if I submit on a single node (e.g., bash submit.sh p1 4 config_file), it works as expected. However, submitting on multiple nodes (e.g., bash submit.sh p1 8 config_file) produced the following error:
slurmstepd: error: execve(): python: No such file or directory
This means that the Python environment was not recognized on one of the nodes. I tried replacing python with $(which python) to take the full path to the Python binary in the virtual environment, but then I obtained another error:
OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory
If I don't use submit.sh but instead, add all the #SBATCH variable to train.slurm, and submit the job using sbatch directly from the command line, then it works. Therefore, it seems that wrapping sbatch inside a bash script caused this issue.
Could you please help me to resolve this?
Thank you so much in advance.
Beware that the --export parameter will cause the environment for srun to be reset to exactly all the SLURM_* variables plus the ones explicitly set, so in your case CONFIG,NGPUs, NGPUS_PER_NODE. Consequently, the PATH variable will not be set and srun will not find the python executable.
Note that the --export does not alter the environment of the submission script, so the single-node case, that does not use srun, does indeed run fine.
Try submitting with
--export=ALL,CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
Note the added ALL as first item in the list.
Another option is to simply remove the --export line entirely and export the variables explicitly in the submit.sh script as the submission environment is propagated by default by Slurm to the job.
export PARTITION=$1
export NGPUs=$2
export CONFIG=$3
export NGPUS_PER_NODE=4
export NCPUS_PER_TASK=10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With