You need to run, say, 30 srun jobs, but ensure each of the jobs is run on a node from the particular list of nodes (that have the same performance, to fairly compare timings). How would you do it?
What I tried:
srun --nodelist=machineN[0-3] <some_cmd>
: runs <some_cmd>
on all the nodes simultaneously (what i need: to run <some_cmd>
on one of the available nodes from the list)
srun -p partition
seems to work, but needs a partition that contains exactly machineN[0-3], which is not always the case.
Ideas?
Partitions in Slurm can be considered as a resource abstraction. A partition configuration defines job limits or access controls for a group of nodes.
So in other words, a task cannot be split across multiple nodes. So using --cpus-per-task will ensure it gets allocated to the same node, while using --ntasks can and may allocate it to multiple nodes.
Nodes possess resources such as processors, memory, swap, local disk, etc. and jobs consume these resources. The exclusive use default policy in Slurm can result in inefficient utilization of the cluster and of its nodes resources.
You can go the opposite direction and use the --exclude
option of sbatch
:
srun --exclude=machineN[4-XX] <some_cmd>
Then slurm will only consider nodes that are not listed in the excluded list. If the list is long and complicated, it can be saved in a file.
Another option is to check whether the Slurm configuration includes ''features'' with
sinfo --format "%20N %20f"
If the 'features' column shows a comma-delimited list of features each node has (might be CPU family, network connection type, etc.), you can select a subset of the nodes with a specific features using
srun --constraint=<some_feature> <some_cmd>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With