How to "undrain" slurm nodes in drain state



Using sinfo it shows 3 nodes are in drain state,

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST all*         up   infinite      3  drain node[10,11,12] 

Which command line should I use to undrain such nodes?

People also ask

What is drain state in slurm?

It means no further job will be scheduled on that node, but the currently running jobs will keep running (by contrast with setting the node down which kills all jobs running on the node).

Why is node in drain state?

Note: drain means that the node is up but is not accepting new jobs. If jobs were running on the node it would say drng: the running jobs would be allowed to complete at which time the node would enter the drain state. A node is “drained” by the cluster administrators for maintenance or updates.

How do I reset slurm node?

If slurmctld is not running, restart it (typically as user root using the command "/etc/init. d/slurm start").

2 Answers

Found an approach, enter scontrol interpreter (in command line type scontrol) and then

scontrol: update NodeName=node10 State=DOWN Reason="undraining" scontrol: update NodeName=node10 State=RESUME 


scontrol: show node node10 

displays amongst other info


Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir is full, thus in Ubuntu sudo apt-get clean to remove /var/cache/apt contents and also gzipped some /var/log files.

If no jobs are currently running on the node:

scontrol update nodename=node10 state=idle 

If jobs are running on the node:

scontrol update nodename=node10 state=resume 
