Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SGE - QSUB fails to submit jobs in -sync mode

I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.

The jobs are submitted with the -sync yoption to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.

This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.

Thus, it is fairly crucial that I be able to submit jobs with this -sync y option.

Unfortunately, I keep getting the following error:

Unable to initialize environment because of error: range_list containes no elements

Notice the improper spelling of 'containes'. That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.

The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID} and *.o{JOBID}. The submission just completely fails.

Searching google for this error message only results in unresolved posts on obscure message board.

This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.

My hope is that someone here can figure this out.

Answers to any of these questions would thus solve my problem:

  1. Does this error persist in more recent versions of SGE?
  2. Can I alter my command line options for qsub to avoid this?
  3. What the hell is this error message talking about?
like image 439
EMiller Avatar asked Feb 03 '11 06:02

EMiller


People also ask

What is SGE cluster?

Scheduler, queues and slots SGE includes both a scheduler for allocating resources (CPUs) to computational jobs and a queuing mechanism. Each queue is associated with a number of slots: one computational process runs in each slot; each compute node in the HPC cluster provides one or more slots.


1 Answers

Our site hit this issue in SGE 6.2u5. I've posted some questions on the mailing list, but there was no solution. Until now.

It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.

Here are the related commits in the github repo:

  • https://github.com/gridengine/gridengine/commit/b449607972614e4608272d8c0fc6f109d35fbecc
  • https://github.com/gridengine/gridengine/commit/a47c32f965111554ec076db1526a6ad62c5bdae5

What the error message should say is that you've hit the limit on the number of qsub sync -y jobs in the system. This parameter is known as MAX_DYN_EC. The default in our version was 99, and the changes above increase that default to 1000.

The definition of MAX_DYN_EC (from the sge_conf(5) man page) is:

Sets the max number of dynamic event clients (as used by qsub -sync y and by Grid Engine DRMAA API library sessions). The default is set to 99. The number of dynamic event clients should not be bigger than half of the number of file descriptors the system has. The number of file descriptors are shared among the connections to all exec hosts, all event clients, and file handles that the qmaster needs.

You can check how many dynamic event clients you using the following command:

$ qconf -secl | grep qsub | wc -l

We have added MAX_DYN_EC=1000 to qmaster_params via qconf -mconf. I've tested submitting hundreds of qsub -sync y jobs and we no longer hit the range_list error. Prior to the MAX_DYN_EC change, doing so would reliably trigger the error.

like image 184
bdobbie Avatar answered Jan 02 '23 21:01

bdobbie