Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to "undrain" slurm nodes in drain state

Tags:

slurm

Using sinfo it shows 3 nodes are in drain state,

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST all*         up   infinite      3  drain node[10,11,12] 

Which command line should I use to undrain such nodes?

like image 963
elm Avatar asked Apr 09 '15 09:04

elm


People also ask

What is drain state in slurm?

It means no further job will be scheduled on that node, but the currently running jobs will keep running (by contrast with setting the node down which kills all jobs running on the node).

Why is node in drain state?

Note: drain means that the node is up but is not accepting new jobs. If jobs were running on the node it would say drng: the running jobs would be allowed to complete at which time the node would enter the drain state. A node is “drained” by the cluster administrators for maintenance or updates.

How do I reset slurm node?

If slurmctld is not running, restart it (typically as user root using the command "/etc/init. d/slurm start").


2 Answers

Found an approach, enter scontrol interpreter (in command line type scontrol) and then

scontrol: update NodeName=node10 State=DOWN Reason="undraining" scontrol: update NodeName=node10 State=RESUME 

Then

scontrol: show node node10 

displays amongst other info

State=IDLE 

Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir is full, thus in Ubuntu sudo apt-get clean to remove /var/cache/apt contents and also gzipped some /var/log files.

like image 151
elm Avatar answered Oct 01 '22 23:10

elm


If no jobs are currently running on the node:

scontrol update nodename=node10 state=idle 

If jobs are running on the node:

scontrol update nodename=node10 state=resume 
like image 37
irritable_phd_syndrome Avatar answered Oct 01 '22 23:10

irritable_phd_syndrome