Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I keep getting NonZeroExitCode when using sbatch SLURM?

I have a simple test.ksh that I am running with the command:

sbatch test.ksh

I keep getting "JobState=FAILED Reason=NonZeroExitCode" (using "scontrol show job")

I have already made sure of the following:

  1. slurmd and slurmctld are up and running correctly
  2. user privileges on "test.ksh" is 777.
  3. The command "srun test.ksh" (by itself, without using sbatch) succeeds without problems
  4. I tried putting in a "return 0" in the last line of "test.ksh" without luck
  5. I tried putting in a "exit 0" in the last line of "test.ksh" without luck
  6. I tried putting in "hostname" in the last line of "test.ksh" without luck
  7. I tried putting in "srun hostname" in the last line of "test.ksh" without luck
like image 527
user3200387 Avatar asked Jan 22 '15 16:01

user3200387


People also ask

What Shell does slurm use?

Slurm processes are not run under a shell, but directly exec'ed by the slurmd daemon (assuming srun is used to launch the processes).

What is Sbatch command?

You use the sbatch command with a bash script to specify the resources you need to run your jobs, such as the number of nodes you want to run your jobs on and how much memory you'll need. Slurm then schedules your job based on the availability of the resources you've specified.

What is nodes in slurm?

Nodes possess resources such as processors, memory, swap, local disk, etc. and jobs consume these resources. The exclusive use default policy in Slurm can result in inefficient utilization of the cluster and of its nodes resources.

What is SRUN?

srun is a means of synchronously submitting a single command to run in parallel on a new or existing allocation. It is inherently synchronous because it attempts to launch tasks on an allocated resource, waits (blocks) until these resources are available, and returns only when the tasks have completed.


1 Answers

I found out that I hadn't set --error and --output, which meant that the default was the current directory from which I was issuing the command.

The problem was that I didn't have sufficient privileges to write to the current directory.

The solution was to set the --error and --output to directories to a place where I had privileges.

like image 141
user3200387 Avatar answered Oct 11 '22 10:10

user3200387