How to "undrain" slurm nodes in drain state

Tags:

slurm

Using sinfo it shows 3 nodes are in drain state,

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST all*         up   infinite      3  drain node[10,11,12]

Which command line should I use to undrain such nodes?

963

asked Apr 09 '15 09:04

2 Answers

Found an approach, enter scontrol interpreter (in command line type scontrol) and then

scontrol: update NodeName=node10 State=DOWN Reason="undraining" scontrol: update NodeName=node10 State=RESUME

Then

scontrol: show node node10

displays amongst other info

State=IDLE

Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir is full, thus in Ubuntu sudo apt-get clean to remove /var/cache/apt contents and also gzipped some /var/log files.

151

answered Oct 01 '22 23:10

elm

If no jobs are currently running on the node:

scontrol update nodename=node10 state=idle

If jobs are running on the node:

scontrol update nodename=node10 state=resume

answered Oct 01 '22 23:10

irritable_phd_syndrome

Related questions
                            
                                Submit and monitor SLURM jobs using Apache Airflow
                            
                                SLURM sacct shows 'batch' and 'extern' job names
                            
                                Running slurm script with multiple nodes, launch job steps with 1 task
                            
                                Python - Log memory usage
                            
                                How to run code in a debugging session from VS code on a remote using an interactive session?
                            
                                Changing the bash script sent to sbatch in slurm during run a bad idea?
                            
                                How to find from where a job is submitted in SLURM?
                            
                                Slurm: What is the difference for code executing under salloc vs srun
                            
                                Limit the number of running jobs in SLURM
                            
                                Use Bash variable within SLURM sbatch script
                            
                                Slurm: Why use srun inside sbatch?
                            
                                SLURM: How to run 30 jobs on particular nodes only?
                            
                                How do I save print statements when running a program in SLURM?
                            
                                SLURM: see how many cores per node, and how many cores per job
                            
                                How do the terms "job", "task", and "step" relate to each other?
                            
                                How to submit a job to any [subset] of nodes from nodelist in SLURM?
                            
                                HPC cluster: select the number of CPUs and threads in SLURM sbatch
                            
                                Use slurm job id
                            
                                What does the status "CG" mean in SLURM?
                            
                                Expand columns to see full jobname in Slurm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With