Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"embarrassingly parallel" programming using python and PBS on a cluster

I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.

Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.

To simplify things, I have a simple function such as:

import myModule
def model(input, a= 1., N=100):
    do_lots_number_crunching(input, a,N)
    pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')

where input is an object representing the input, input.name is a string, anddo_lots_number_crunching may last hours.

My question is: is there a correct way to transform something like a scan of parameters such as

for a in pylab.linspace(0., 1., 100):
    model(input, a)

into "something" that would launch a PBS script for every call to the model function?

#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py

I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).

like image 458
meduz Avatar asked Jul 22 '10 10:07

meduz


2 Answers

pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do

import pbs, os

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)

attopl = pbs.new_attropl(4)
attropl[0].name  = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'

attropl[1].name  = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'

attropl[2].name  = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'

attrop1[3].name = pbs.ATTR_V

script='''
cd /data/work/
python experiment_model.py %f
'''

jobs = []

for a in pylab.linspace(0.,1.,100):
    script_name = 'experiment_model.job' + str(a)
    with open(script_name,'w') as scriptf:
        scriptf.write(script % a)
    job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
    jobs.append(job_id)
    os.remove(script_name)

 print jobs

[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

like image 114
macedoine Avatar answered Oct 21 '22 11:10

macedoine


You can do this easily using jug (which I developed for a similar setup).

You'd write in file (e.g., model.py):

@TaskGenerator
def model(param1, param2):
     res = complex_computation(param1, param2)
     pyplot.coolgraph(res)


for param1 in np.linspace(0, 1.,100):
    for param2 in xrange(2000):
        model(param1, param2)

And that's it!

Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:

while not all_done():
    for t in tasks in tasks_that_i_can_run():
        if t.lock_for_me(): t.run()

(It's actually more complicated than that, but you get the point).

It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.

This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

like image 35
luispedro Avatar answered Oct 21 '22 11:10

luispedro