Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

qsub returns error when submitting jobs from node

I have a complex fortran MPI application running under a Torque/Maui system. When I run my application it produces a huge unique output (~20 GB). To avoid that, I produced a RunJob script that breaks up the running in 5 pieces, each producing smaller outputs much easier to handle.

For the moment my RunJob script stops correctly at the end of the first piece and also produces the correct output. However, when it tries to restart I get the following error message:

qsub: Bad UID for job execution MSG=ruserok failed validating username/username from compute-0-0.local

I know that this problem comes from the fact the Torque/Maui system by default does not allow a node to submit a job.

In fact, when I type this:

qmgr -c 'l s' | grep allow_node_submit

I got:

allow_node_submit = False

I do not have an administrator account just a user one

My questions are:

  1. Is it possible to set allow_node_submit = true on the gmgr being a user ? How ? (- I guess not)
  2. If question 1 = false, is there another way to work around this ? How ?

All the best.

like image 632
Quim Avatar asked May 17 '26 00:05

Quim


1 Answers

No, an unprivileged user can't change the settings of the queuing system. The usual reason for not allowing resubmission from the compute nodes is a pretty good one - to protect the cluster and all of its users from someone accidentally (or otherwise) submitting a script which fails quickly and re-submits itself once - or much worse, more than once - quickly flooding the scheduler and queue, generating the batch queue equivalent of a fork bomb. Even with such restrictions we've had people accidentally submit tens of thousands of jobs at once due to scripting errors.

The usual work around is to ssh to one of the queue submission nodes and submit the script from there, e.g. at the end of your submissions script:

ssh queue-head-node qsub /path/to/new/submission/script

This is how we suggest our users handle it, e.g. here. That obviously will only work if you have password/passphrase-less ssh enabled within the cluster, which is a common (but not universal) practice.

Alternatively, if this is for the common case of just automatically submitting a series of jobs which continue a run, you can look to see how job dependencies are handled at your site, and submit a convoy of jobs, each dependent on the successful completion of the last, which will then run in order.

like image 73
Jonathan Dursi Avatar answered May 20 '26 14:05

Jonathan Dursi