Using sinfo
it shows 3 nodes are in drain
state,
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 3 drain node[10,11,12]
Which command line should I use to undrain such nodes?
It means no further job will be scheduled on that node, but the currently running jobs will keep running (by contrast with setting the node down which kills all jobs running on the node).
Note: drain means that the node is up but is not accepting new jobs. If jobs were running on the node it would say drng: the running jobs would be allowed to complete at which time the node would enter the drain state. A node is “drained” by the cluster administrators for maintenance or updates.
If slurmctld is not running, restart it (typically as user root using the command "/etc/init. d/slurm start").
Found an approach, enter scontrol interpreter (in command line type scontrol
) and then
scontrol: update NodeName=node10 State=DOWN Reason="undraining" scontrol: update NodeName=node10 State=RESUME
Then
scontrol: show node node10
displays amongst other info
State=IDLE
Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10
which showed Reason=SlurmdSpoolDir is full
, thus in Ubuntu sudo apt-get clean
to remove /var/cache/apt
contents and also gzipped some /var/log
files.
If no jobs are currently running on the node:
scontrol update nodename=node10 state=idle
If jobs are running on the node:
scontrol update nodename=node10 state=resume
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With