When I use sinfo
I see the following:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
[...]
RG3 up 28-00:00:0 1 drain rg3hpc4
[...]
What does the state 'drain' mean?
Node draining is the mechanism that allows users to gracefully move all containers from one node to the other ones. There are multiple use cases: Server maintenance. Autoscaling of the k8s cluster – nodes are added and removed dynamically. Preemptable or spot instances that can be terminated at any time.
DESCRIPTION. sinfo is used to view partition and node information for a system running Slurm.
Check that compatible versions of Slurm exists on all of the nodes (execute "sinfo -V" or "rpm -qa | grep slurm"). The Slurm version number contains three period-separated numbers that represent both the major Slurm release and maintenance release level.
It means no further job will be scheduled on that node, but the currently running jobs will keep running (by contrast with setting the node down
which kills all jobs running on the node).
Nodes are often set to that state so that some maintenance operation can take place once all running jobs are finished.
From the manpage of the scontrol command:
If you want to remove a node from service, you typically want to set it's state to "DRAIN"
Note that the system administrator most probably gave a reason why the node is drained, and you can see that reason with
sinfo -R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With