I am trying to adjust some bash scripts to make them run on a (pbs) cluster. The individual tasks are performed by several script thats are started by a main script. So far this main scripts starts multiple scripts in background (by appending <code>&</code>) making them run in parallel on one multi core machine. I want to substitute these calls by <code>qsub</code>s to distribute load accross the cluster nodes. However, some jobs depend on others to be finished before they can start. So far, this was achieved by <code>wait</code> statements in the main script. But what is the best way to do this using the grid engine? I already found this question as well as the <code>-W after:jobid[:jobid...]</code> documentation in the <code>qsub</code> man page but I hope there is a better way. We are talking about several thousend jobs to run in parallel first and another set of the same size to run simultatiously after the last one of these finished. This would mean I had to queue a lot of jobs depending on a lot of jobs. I could bring this down by using a dummy job in between, doing nothing but depending on the first group of jobs, on which the second group could depend. This would decrease the number of dependencies from millions to thousands but still: It feeles wrong and I am not even sure if such a long command line would be accepted by the shell. <ul> <li>Isn't there a way to wait for all my jobs to finish (something like <code>qwait -u <user></code>)?</li> <li>Or all jobs that where submitted from this script (something like <code>qwait [-p <PID>]</code>)?</li> </ul> Of course it would be possible to write something like this using <code>qstat</code> and <code>sleep</code> in a <code>while</code> loop, but I guess this use case is important enough to have a built in solution and I was just incapable to figure that one out. What would you recommend / use in such a situation? Addendum I: Since it was requested in a comment: <pre class="prettyprint lang-none prettyprint-override"><code>$ qsub --version version: 2.4.8 </code></pre> Maybe also helpful to determine the exact pbs system: <pre class="prettyprint lang-none prettyprint-override"><code>$ qsub --help usage: qsub [-a date_time] [-A account_string] [-b secs] [-c [ none | { enabled | periodic | shutdown | depth=<int> | dir=<path> | interval=<minutes>}... ] [-C directive_prefix] [-d path] [-D path] [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}] [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue] [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path [-W otherattributes=value...] [-v variable_list] [-V] [-x] [-X] [-z] [script] </code></pre> Since the comments point to job arrays so far I searched the <code>qsub</code> man page with the following results: <pre class="prettyprint lang-none prettyprint-override"><code>[...] DESCRIPTION [...] In addition to the above, the following environment variables will be available to the batch job. [...] PBS_ARRAYID each member of a job array is assigned a unique identifier (see -t) [...] OPTIONS [...] -t array_request Specifies the task ids of a job array. Single task arrays are allowed. The array_request argument is an integer id or a range of integers. Multiple ids or id ranges can be combined in a comman delimeted list. Examples : -t 1-100 or -t 1,10,50-100 [...] </code></pre> Addendum II: I have tried the torque solution given by Dmitri Chubarov but it does not work as described. Without the job arrray it works as expected: <pre class="prettyprint"><code>testuser@headnode ~ $ qsub -W depend=afterok:`qsub ./test1.sh` ./test2 && qstat 2553.testserver.domain Job id Name User Time Use S Queue ----------------------- ---------------- --------------- -------- - ----- 2552.testserver Test1 testuser 0 Q testqueue 2553.testserver Test2 testuser 0 H testqueue testuser@headnode ~ $ qstat Job id Name User Time Use S Queue ----------------------- ---------------- --------------- -------- - ----- 2552.testserver Test1 testuser 0 R testqueue 2553.testserver Test2 testuser 0 H testqueue testuser@headnode ~ $ qstat Job id Name User Time Use S Queue ----------------------- ---------------- --------------- -------- - ----- 2553.testserver Test2 testuser 0 R testqueue </code></pre> However, using job arrays the second job won't start: <pre class="prettyprint"><code>testuser@headnode ~ $ qsub -W depend=afterok:`qsub -t 1-2 ./test1.sh` ./test2 && qstat 2555.testserver.domain Job id Name User Time Use S Queue ----------------------- ---------------- --------------- -------- - ----- 2554-1.testserver Test1-1 testuser 0 Q testqueue 2554-2.testserver Test1-1 testuser 0 Q testqueue 2555.testserver Test2 testuser 0 H testqueue testuser@headnode ~ $ qstat Job id Name User Time Use S Queue ----------------------- ---------------- --------------- -------- - ----- 2554-1.testserver Test1-1 testuser 0 R testqueue 2554-2.testserver Test1-2 testuser 0 R testqueue 2555.testserver Test2 testuser 0 H testqueue testuser@headnode ~ $ qstat Job id Name User Time Use S Queue ----------------------- ---------------- --------------- -------- - ----- 2555.testserver Test2 testuser 0 H testqueue </code></pre> I guess this is due to the lack of array indication in the job id that is returned by the first <code>qsub</code>: <pre class="prettyprint"><code>testuser@headnode ~ $ qsub -t 1-2 ./test1.sh 2556.testserver.domain </code></pre> As you can see there is no <code>...[]</code> indicating this being a job array. Also, in the <code>qsub</code> output there are no <code>...[]</code>s but <code>...-1</code> and <code>...-2</code> indicating the array. So the remaining question is how to format <code>-W depend=afterok:...</code> to make a job depend on a specified job array.

Filling in following the solution suggested by Jonathan in the comments. There are several resource managers based on the original Portable Batch System: OpenPBS, TORQUE and PBS Professional. The systems had diverged significantly and use different command syntax for newer features such as job arrays. Job arrays are a convenient way to submit multiple similar jobs based on the same job script. Quoting from the manual: <blockquote> Sometimes users will want to submit large numbers of jobs based on the same job script. Rather than using a script to repeatedly call qsub, a feature known as job arrays now exists to allow the creation of multiple jobs with one qsub command. </blockquote> To submit a job array PBS provides the following syntax: <pre class="prettyprint"><code> qsub -t 0-10,13,15 script.sh </code></pre> this submits jobs with ids from 0,1,2,...,10,13,15. Within the script the variable <code>PBS_ARRAYID</code> carries the id of the job within the array and can be used to pick the necessary configuration. Job array have their specific dependency options. <h3>TORQUE</h3> TORQUE resource manager that is probably used in the OP. There additional dependency options are provided that can be seen in the following example: <pre class="prettyprint"><code>$ qsub -t 1-1000 script.sh 1234[].pbsserver.domainname $ qsub -t 1001-2000 -W depend=afterokarray:1234[] script.sh 1235[].pbsserver.domainname </code></pre> This will result in the following <code>qstat</code> output <pre class="prettyprint"><code>1234[] script.sh user 0 R queue 1235[] script.sh user 0 H queue </code></pre> Tested on torque version 3.0.4 The full afterokarray syntax is in the <code>qsub(1)</code> manual. <h3>PBS Professional</h3> In PBS Professional dependencies can work uniformly on ordinary jobs and array jobs. Here is an example: <pre class="prettyprint"><code>$ qsub -J 1-1000 -ry script.sh 1234[].pbsserver.domainname $ qsub -J 1001-2000 -ry -W depend=afterok:1234[] script.sh 1235[].pbsserver.domainname </code></pre> This will result in the following <code>qstat</code> output <pre class="prettyprint"><code>1234[] script.sh user 0 B queue 1235[] script.sh user 0 H queue </code></pre> <h3>Update on Torque versions</h3> Array dependencies became available in Torque since version 2.5.3. Job arrays from version 2.5 are not compatible with job arrays in versions 2.3 or 2.4. In particular the <code>[]</code> syntax was introduced in Torque since version 2.5. <h3>Update on using a delimeter job</h3> For torque versions prior to 2.5 a different solution may work that is based on submitting dummy delimeter jobs between batches of jobs to be separated. It uses three dependency types: <code>on</code>,<code>before</code>, and <code>after</code>. Consider the following example <pre class="prettyprint"><code> $ DELIM=`qsub -Wdepend=on:1000 dummy.sh ` $ qsub -Wdepend=beforeany:$DELIM script.sh 1001.pbsserver.domainname ... another 998 jobs ... $ qsub -Wdepend=beforeany:$DELIM script.sh 2000.pbsserver.domainname $ qsub -Wdepend=after:$DELIM script.sh 2001.pbsserver.domainname ... </code></pre> This will result in the queue state like this <pre class="prettyprint"><code>1000 dummy.sh user 0 H queue 1001 script.sh user 0 R queue ... 2000 script.sh user 0 R queue 2001 script.sh user 0 H queue ... </code></pre> That is the job #2001 will run only after the previous 1000 jobs terminate. Probably the rudimentary job array facilities available in TORQUE 2.4 can be used as well to submit the script job. This solution will also work for TORQUE version 2.5 and higher.

Wait for all jobs of a user to finish before submitting subsequent jobs to a PBS cluster

Tags:

shell

wait

cluster-computing

qsub

pbs

I am trying to adjust some bash scripts to make them run on a (pbs) cluster.

The individual tasks are performed by several script thats are started by a main script. So far this main scripts starts multiple scripts in background (by appending &) making them run in parallel on one multi core machine. I want to substitute these calls by qsubs to distribute load accross the cluster nodes.

However, some jobs depend on others to be finished before they can start. So far, this was achieved by wait statements in the main script. But what is the best way to do this using the grid engine?

I already found this question as well as the -W after:jobid[:jobid...] documentation in the qsub man page but I hope there is a better way. We are talking about several thousend jobs to run in parallel first and another set of the same size to run simultatiously after the last one of these finished. This would mean I had to queue a lot of jobs depending on a lot of jobs.

I could bring this down by using a dummy job in between, doing nothing but depending on the first group of jobs, on which the second group could depend. This would decrease the number of dependencies from millions to thousands but still: It feeles wrong and I am not even sure if such a long command line would be accepted by the shell.

Isn't there a way to wait for all my jobs to finish (something like qwait -u <user>)?
Or all jobs that where submitted from this script (something like qwait [-p <PID>])?

Of course it would be possible to write something like this using qstat and sleep in a while loop, but I guess this use case is important enough to have a built in solution and I was just incapable to figure that one out.

What would you recommend / use in such a situation?

Addendum I:

Since it was requested in a comment:

$ qsub --version
version: 2.4.8

Maybe also helpful to determine the exact pbs system:

$ qsub --help
usage: qsub [-a date_time] [-A account_string] [-b secs]
      [-c [ none | { enabled | periodic | shutdown |
      depth=<int> | dir=<path> | interval=<minutes>}... ]
      [-C directive_prefix] [-d path] [-D path]
      [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]
      [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue]
      [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path
      [-W otherattributes=value...] [-v variable_list] [-V] [-x] [-X] [-z] [script]

Since the comments point to job arrays so far I searched the qsub man page with the following results:

[...]
DESCRIPTION
[...]
       In addition to the above, the following environment variables will be available to the batch job.
[...]
       PBS_ARRAYID
              each member of a job array is assigned a unique identifier (see -t)
[...]
OPTIONS
[...]
       -t array_request
               Specifies the task ids of a job array. Single task arrays are allowed.
               The array_request argument is an integer id or a range of integers. Multiple ids or id ranges can be combined in a comman delimeted list. Examples : -t 1-100 or -t 1,10,50-100
[...]

Addendum II:

I have tried the torque solution given by Dmitri Chubarov but it does not work as described.

Without the job arrray it works as expected:

testuser@headnode ~ $ qsub -W depend=afterok:`qsub ./test1.sh` ./test2 && qstat
2553.testserver.domain
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2552.testserver         Test1            testuser               0 Q testqueue
2553.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2552.testserver         Test1            testuser               0 R testqueue
2553.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2553.testserver         Test2            testuser               0 R testqueue

However, using job arrays the second job won't start:

testuser@headnode ~ $ qsub -W depend=afterok:`qsub -t 1-2 ./test1.sh` ./test2 && qstat
2555.testserver.domain
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2554-1.testserver       Test1-1          testuser               0 Q testqueue
2554-2.testserver       Test1-1          testuser               0 Q testqueue
2555.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2554-1.testserver       Test1-1          testuser               0 R testqueue
2554-2.testserver       Test1-2          testuser               0 R testqueue
2555.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2555.testserver         Test2            testuser               0 H testqueue

I guess this is due to the lack of array indication in the job id that is returned by the first qsub:

testuser@headnode ~ $ qsub -t 1-2 ./test1.sh
2556.testserver.domain

As you can see there is no ...[] indicating this being a job array. Also, in the qsub output there are no ...[]s but ...-1 and ...-2 indicating the array.

So the remaining question is how to format -W depend=afterok:... to make a job depend on a specified job array.

435

asked Aug 26 '13 10:08

mschilli

1 Answers

Filling in following the solution suggested by Jonathan in the comments.

There are several resource managers based on the original Portable Batch System: OpenPBS, TORQUE and PBS Professional. The systems had diverged significantly and use different command syntax for newer features such as job arrays.

Job arrays are a convenient way to submit multiple similar jobs based on the same job script. Quoting from the manual:

Sometimes users will want to submit large numbers of jobs based on the same job script. Rather than using a script to repeatedly call qsub, a feature known as job arrays now exists to allow the creation of multiple jobs with one qsub command.

To submit a job array PBS provides the following syntax:

 qsub -t 0-10,13,15 script.sh

this submits jobs with ids from 0,1,2,...,10,13,15.

Within the script the variable PBS_ARRAYID carries the id of the job within the array and can be used to pick the necessary configuration.

Job array have their specific dependency options.

TORQUE

TORQUE resource manager that is probably used in the OP. There additional dependency options are provided that can be seen in the following example:

$ qsub -t 1-1000 script.sh
1234[].pbsserver.domainname
$ qsub -t 1001-2000 -W depend=afterokarray:1234[] script.sh
1235[].pbsserver.domainname

This will result in the following qstat output

1234[]         script.sh    user          0 R queue
1235[]         script.sh    user          0 H queue

Tested on torque version 3.0.4

The full afterokarray syntax is in the qsub(1) manual.

PBS Professional

In PBS Professional dependencies can work uniformly on ordinary jobs and array jobs. Here is an example:

$ qsub -J 1-1000 -ry script.sh
1234[].pbsserver.domainname
$ qsub -J 1001-2000 -ry -W depend=afterok:1234[] script.sh
1235[].pbsserver.domainname

This will result in the following qstat output

1234[]         script.sh    user          0 B queue
1235[]         script.sh    user          0 H queue

Update on Torque versions

Array dependencies became available in Torque since version 2.5.3. Job arrays from version 2.5 are not compatible with job arrays in versions 2.3 or 2.4. In particular the [] syntax was introduced in Torque since version 2.5.

Update on using a delimeter job

For torque versions prior to 2.5 a different solution may work that is based on submitting dummy delimeter jobs between batches of jobs to be separated. It uses three dependency types: on,before, and after.

Consider the following example

 $ DELIM=`qsub -Wdepend=on:1000 dummy.sh `
 $ qsub -Wdepend=beforeany:$DELIM script.sh
 1001.pbsserver.domainname
 ... another 998 jobs ...
 $ qsub -Wdepend=beforeany:$DELIM script.sh
 2000.pbsserver.domainname
 $ qsub -Wdepend=after:$DELIM script.sh
 2001.pbsserver.domainname
 ...

This will result in the queue state like this

1000         dummy.sh    user          0 H queue
1001         script.sh   user          0 R queue   
...
2000         script.sh   user          0 R queue   
2001         script.sh   user          0 H queue
...

That is the job #2001 will run only after the previous 1000 jobs terminate. Probably the rudimentary job array facilities available in TORQUE 2.4 can be used as well to submit the script job.

This solution will also work for TORQUE version 2.5 and higher.

155

answered Nov 15 '22 04:11

Dmitri Chubarov

Related questions
                            
                                how to call shell commands from groovy correctly
                            
                                Idiomatic Analog to Ruby's `Object#tap` for Unix command Pipelines?
                            
                                How can I get Zsh to display the RVM gemset?
                            
                                Not able to run executable file from adb shell
                            
                                killing the background process in a shell script
                            
                                Combining lines from a text file in Unix?
                            
                                Is there a way to tell sed to ignore symlinks?
                            
                                Custom commands with git-shell
                            
                                minimal typing command line calculator - tcsh vs bash
                            
                                How to change encoding for existing file with Vim
                            
                                Shell script for logging cpu and memory usage of a linux process
                            
                                Pasting long lines into Mac OS X Terminal
                            
                                Piping output from 1 command to other command in an adb shell command line
                            
                                Have a script wait until the last script is complete
                            
                                Generate a Properties File using Shell Script and Results from a SQL Query
                            
                                Are there any standard mechanisms or conventions to prevent Bash environment variable name collisions?
                            
                                Bash while loop with two string conditions
                            
                                PIP command to determine if the latest version is installed?
                            
                                Check the output of a command in shell script
                            
                                Shell script shebang for unknown path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With