I have a simple test.ksh that I am running with the command:
sbatch test.ksh
I keep getting "JobState=FAILED Reason=NonZeroExitCode" (using "scontrol show job")
I have already made sure of the following:
Slurm processes are not run under a shell, but directly exec'ed by the slurmd daemon (assuming srun is used to launch the processes).
You use the sbatch command with a bash script to specify the resources you need to run your jobs, such as the number of nodes you want to run your jobs on and how much memory you'll need. Slurm then schedules your job based on the availability of the resources you've specified.
Nodes possess resources such as processors, memory, swap, local disk, etc. and jobs consume these resources. The exclusive use default policy in Slurm can result in inefficient utilization of the cluster and of its nodes resources.
srun is a means of synchronously submitting a single command to run in parallel on a new or existing allocation. It is inherently synchronous because it attempts to launch tasks on an allocated resource, waits (blocks) until these resources are available, and returns only when the tasks have completed.
I found out that I hadn't set --error and --output, which meant that the default was the current directory from which I was issuing the command.
The problem was that I didn't have sufficient privileges to write to the current directory.
The solution was to set the --error and --output to directories to a place where I had privileges.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With