I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.
The jobs are submitted with the -sync y
option to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.
This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.
Thus, it is fairly crucial that I be able to submit jobs with this -sync y
option.
Unfortunately, I keep getting the following error:
Unable to initialize environment because of error: range_list containes no elements
Notice the improper spelling of 'containes'. That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.
The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID}
and *.o{JOBID}
. The submission just completely fails.
Searching google for this error message only results in unresolved posts on obscure message board.
This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.
My hope is that someone here can figure this out.
Answers to any of these questions would thus solve my problem:
Scheduler, queues and slots SGE includes both a scheduler for allocating resources (CPUs) to computational jobs and a queuing mechanism. Each queue is associated with a number of slots: one computational process runs in each slot; each compute node in the HPC cluster provides one or more slots.
Our site hit this issue in SGE 6.2u5. I've posted some questions on the mailing list, but there was no solution. Until now.
It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.
Here are the related commits in the github repo:
What the error message should say is that you've hit the limit on the number of qsub sync -y
jobs in the system. This parameter is known as MAX_DYN_EC
. The default in our version was 99, and the changes above increase that default to 1000.
The definition of MAX_DYN_EC
(from the sge_conf(5) man page) is:
Sets the max number of dynamic event clients (as used by qsub -sync y and by Grid Engine DRMAA API library sessions). The default is set to 99. The number of dynamic event clients should not be bigger than half of the number of file descriptors the system has. The number of file descriptors are shared among the connections to all exec hosts, all event clients, and file handles that the qmaster needs.
You can check how many dynamic event clients you using the following command:
$ qconf -secl | grep qsub | wc -l
We have added MAX_DYN_EC=1000
to qmaster_params
via qconf -mconf
. I've tested submitting hundreds of qsub -sync y
jobs and we no longer hit the range_list error. Prior to the MAX_DYN_EC
change, doing so would reliably trigger the error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With