Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep slurm array tasks confined in a single node

Tags:

slurm

I want to submit an array job to slurm with a 100 tasks, each using just one cpu. I have access to a cluster with 10 nodes and 24 cores each with hyperthreading activated. I am limiting the number of concurrent jobs with --array=1-100%24 trying to keep all jobs in a single node and leave the rest of the cluster free for other users, but the 24 tasks are executed in an arbitrary number of nodes. I've tried --nodes=1 or --distribution=block:block to override the cyclic distribution, both unsuccessfully: the 24 simultaneous tasks run in more than one node.

Browsing in stackoverflow I've seen an older question that solved it by giving a list of nodes to exclude. It works for me, but I think it defeats the idea of having a job scheduler to optimize the cluster usage.

Here's an example script I'm using to solve this.

Thanks a lot, Pablo

#!/bin/sh  
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks=1
#SBATCH --output=output/test.log_%A_%a.out
#SBATCH --error=output/test.log_%A_%a.err 
#SBATCH --array=1-100%48
#SBATCH --distribution=block:block
#SBATCH --nodes=1

# Display all variables set by slurm
env | grep "^SLURM" | sort

# Print hostname job executed on.
echo
echo "My hostname is: $(hostname -s)"
echo

sleep 30
like image 368
pau Avatar asked Nov 07 '22 02:11

pau


1 Answers

I am assuming that the other users prefer having entire nodes for their jobs too. Because most of the time, the admins will prefer job arrays to be able to fill in the gaps with the one-cpu jobs.

You can try to use option --exclusive=user. This way, slurm will reserve a full node for the first job to start in the array, and then will schedule all the others on the same machine as only your jobs will be allowed there.

Another option is to pack 24 jobs into a single one with 24 tasks and request --nodes=1 and --tasks-per-node=24, and use srun within the submission script to run the 24 tasks.

like image 73
damienfrancois Avatar answered Nov 15 '22 11:11

damienfrancois