I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes. Currently I submit each of the jobs as follow: <pre class="prettyprint"><code>sbatch --nodelist=myCluster[10-16] myScript.sh </code></pre> However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability. What are the best parameters of <code>sbatch</code> that force slurm to run multiple jobs at the same time on the specified nodes?

You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use: <pre class="prettyprint"><code>sbatch --exclude=myCluster[01-09] myScript.sh </code></pre> and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your <code>myScript.sh</code> contains <code>#SBATCH --ntasks=1 --cpu-per-task=n</code> with <code>n</code> the number of threads of each job.

<blockquote> Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. </blockquote> I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively? <blockquote> multiple jobs should run at the same time on a single node. </blockquote> As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., <code>SelectType=select/cons_res</code> and <code>SelectTypeParameters=CR_Core</code> in <code>slurm.conf</code>) Then, to constrain parallel jobs to get a whole node you can either use <code>--exclusive</code> option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use <code>-N 1 --tasks-per-node="number_of_cores_in_a_node"</code> (e.g., <code>-N 1 --ntasks-per-node=8</code>). Note that the latter will only work if all nodes have the same number of cores. <blockquote> None of the tasks should spawn over multiple nodes. </blockquote> This should be guaranteed by <code>-N 1</code>.

How to submit a job to any [subset] of nodes from nodelist in SLURM?

Tags:

batch-processing

cluster-computing

slurm

sbatch

I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes.

Currently I submit each of the jobs as follow:

sbatch --nodelist=myCluster[10-16] myScript.sh

However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability.

What are the best parameters of sbatch that force slurm to run multiple jobs at the same time on the specified nodes?

324

asked Oct 06 '14 12:10

Faber

Video Answer

2 Answers

You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use:

sbatch --exclude=myCluster[01-09] myScript.sh

and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your myScript.sh contains #SBATCH --ntasks=1 --cpu-per-task=n with n the number of threads of each job.

answered Oct 13 '22 08:10

damienfrancois

Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded.

I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively?

multiple jobs should run at the same time on a single node.

As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., SelectType=select/cons_res and SelectTypeParameters=CR_Core in slurm.conf)

Then, to constrain parallel jobs to get a whole node you can either use --exclusive option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use -N 1 --tasks-per-node="number_of_cores_in_a_node" (e.g., -N 1 --ntasks-per-node=8).

Note that the latter will only work if all nodes have the same number of cores.

None of the tasks should spawn over multiple nodes.

This should be guaranteed by -N 1.

answered Oct 13 '22 07:10

Riccardo Murri

Related questions
                            
                                Sharing a Java synchronized block across a cluster, or using a global lock?
                            
                                Erlang clusters
                            
                                In a hadoop cluster, should hive be installed on all nodes?
                            
                                How does weblogic clustering work?
                            
                                Apache Helix vs YARN
                            
                                How to start node.js app with pm2 in cluster mode?
                            
                                Cluster-wide singleton in Websphere Cluster
                            
                                How to make two Kubernetes Services talk to each other?
                            
                                RabbitMQ cluster is not reconnecting after network failure
                            
                                Mongodb cluster with aws cloud formation and auto scaling
                            
                                Load Balancing (HAProxy or other) - Sticky Sessions
                            
                                AttributeError: 'Graph' object has no attribute 'node'
                            
                                Spark - How to run a standalone cluster locally
                            
                                What is socket, core, threads, CPU? [closed]
                            
                                What algorithms there are for failover in a distributed system?
                            
                                Cluster Shared Cache [closed]
                            
                                How do I use Node.js clusters with my simple Express app?
                            
                                In node.js, how to declare a shared variable that can be initialized by master process and accessed by worker processes?
                            
                                Node.JS built in cluster or PM2 clustering?
                            
                                how to specify error log file and output file in qsub

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With