Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?

I was trying to run slurm jobs with srun on the background. Unfortunately, right now due to the fact I have to run things through docker its a bit annoying to use sbatch so I am trying to find out if I can avoid it all together.

From my observations, whenever I run srun, say:

srun docker image my_job_script.py

and close the window where I was running the command (to avoid receiving all the print statements) and open another terminal window to see if the command is still running, it seems that my running script is for some reason cancelled or something. Since it isn't through sbatch it doesn't send me a file with the error log (as far as I know) so I have no idea why it closed.

I also tried:

srun docker image my_job_script.py &

to give control back to me in the terminal. Unfortunately, if I do that it still keeps printing things to my terminal screen, which I am trying to avoid.

Essentially, I log into a remote computer through ssh and then do a srun command, but it seems that if I terminate the communication of my ssh connection, the srun command is automatically killed. Is there a way to stop this?

Ideally I would like to essentially send the script to run and not have it be cancelled for any reason unless I cancel it through scancel and it should not print to my screen. So my ideal solution is:

  1. keep running srun script even if I log out of the ssh session
  2. keep running my srun script even if close the window from where I sent the command
  3. keep running my srun script and let me leave the srun session and not print to my scree (i.e. essentially run to the background)

this would be my idea solution.


For the curious crowd that want to know the issue with sbatch, I want to be able to do (which is the ideal solution):

sbatch docker image my_job_script.py

however, as people will know it does not work because sbatch receives the command docker which isn't a "batch" script. Essentially a simple solution (that doesn't really work for my case) would be to wrap the docker command in a batch script:

#!/usr/bin/sh
docker image my_job_script.py

unfortunately I am actually using my batch script to encode a lot of information (sort of like a config file) of the task I am running. So doing that might affect jobs I do because their underlying file is changing. That is avoided by sending the job directly to sbatch since it essentially creates a copy of the batch script (as noted in this question: Changing the bash script sent to sbatch in slurm during run a bad idea?). So the real solution to my problem would be to actually have my batch script contain all the information that my script requires and then somehow in python call docker and at the same time pass it all the information. Unfortunately, some of the information are function pointers and objects, so its not even clear to me how I would pass such a thing to a docker command ran in python.


or maybe being able to run docker directly to sbatch instead of using a batch script with also solve the problem.

like image 205
Charlie Parker Avatar asked Feb 10 '17 18:02

Charlie Parker


People also ask

What is the difference between Sbatch and Srun?

The main difference is that srun is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch is batch processing and non-blocking (results are written to a file and you can submit other commands right away).

How do you use SRUN in Slurm?

After typing your srun command and options on the command line and pressing enter, Slurm will find and then allocate the resources you specified. Depending on what you specified, it can take a few minutes for Slurm to allocate those resources. You can view all of the srun options on the Slurm documentation website.

What does SRUN hostname do?

srun is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then srun will return. srun accepts many command line options to specify the resources required by the command passed to it.

What is SRUN command?

srun is a means of synchronously submitting a single command to run in parallel on a new or existing allocation. It is inherently synchronous because it attempts to launch tasks on an allocated resource, waits (blocks) until these resources are available, and returns only when the tasks have completed.


1 Answers

The outputs can be redirected with the options -o stdout and -e for stderr.

So, the job can be launched in background and with the outputs redirected:

$ srun -o file.out -e file.errr docker image my_job_script.py &
like image 169
Bub Espinja Avatar answered Sep 21 '22 18:09

Bub Espinja