Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using IPython Parallel on the Sun Grid Engine

I'm trying to use IPython Parallel for a very common scenario, where I want to run simulations on a cluster running Sun Grid Engine, and I can't find a reliable way to do this.

Here's what I am trying to do:

I want to run numerical simulations (using Numpy arrays) with several different parameter values -- the tasks are obviously/'embarrassingly' parallel. I have access (through ssh) to the head node of cluster running Grid Engine. Till now, I was running shell scripts with the QSUB command, but this is quite clumsy(handling node crashes etc.) and I was looking for a way to all of this in Python.

IPython seems ideally suited for this scenario, but it's turning out to be cumbersome to get the setup working smoothly. I start n (say 20) engines using IPCLUSTER on the head node, and then copy the .json files to my local machines from where I connect using IPython.parallel.Client.

I have set IPClusterStart.controller_launcher_class = 'SGEControllerLauncher' and IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'

IPCLUSTER seems to be running fine; I get this output from the head node on the ssh terminal:

-- [IPClusterStart] Starting Controller with SGEControllerLauncher
-- [IPClusterStart] Job submitted with job id: '143396'
-- [IPClusterStart] Starting 4 Engines with SGEEngineSetLauncher
-- [IPClusterStart] Job submitted with job id: '143397'
-- [IPClusterStart] Engines appear to have started successfully

However, I have these issues:

  1. Very often, many of the engines will fail to register with the controller even after I see the message above which says the engines have started successfully. When I start IPCLUSTER with 20 engines, I can see 10 - 15 engines showing up on the Grid Engine queue. I have no idea what happens to the other engines -- there are no output files. Out of these 10-15 engines which start only some of them register with the controller and I see this on their output files:

    ... [IPEngineApp] Using existing profile dir: .../.ipython/profile_sge'
    ... [IPEngineApp] Loading url_file ... .ipython/profile_sge/security/ipcontroller-engine.json'
    ... [IPEngineApp] Registering with controller at tcp://192.168.87.106:63615
    ... [IPEngineApp] Using existing profile dir: .../.ipython/profile_sge'
    ... [IPEngineApp] Completed registration with id 0
    

    On others I see this:

    ... [IPEngineApp] Using existing profile dir: .../.ipython/profile_sge'
    ... [IPEngineApp] Loading url_file .../.ipython/profile_sge/security/ipcontroller-engine.json'
    ... [IPEngineApp] Registering with controller at tcp://192.168.87.115:64909
    ... [IPEngineApp] Registration timed out after 2.0 seconds
    

    Any idea why this happens?

  2. Sometimes, the engines start and register successfully but they start dying when I make them run something very simple like view.execute('%pylab') and the only exception I get back is this:

    [Engine Exception] Traceback (most recent call last): File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 708, in _handle_stranded_msgs raise error.EngineError("Engine %r died while running task %r"%(eid, msg_id)) EngineError: Engine 1 died while running task 'b9601e8a-cff5-4037-b9d9-a0b93ca2f256'

  3. Starting the engines this way means that I occupy the nodes and the queue as long as the engines are running, even if they aren't executing anything. Is there an easy way to start the engines so that they will be spawned only when you want to run some script and they will close once they return the result of their computation?

  4. The Grid Engine seems to start the controller on a different node every time, so the --ruse flag in the IPCLUSTER config files is not useful; I have to copy the JSON files every time I use IPCLUSTER. Is there a way to avoid this?

It would be really helpful if someone can give a simple work-flow for this common scenario: using IPython parallel to submit obviously parallel jobs to a SGE cluster over a SSH connection. There should be some way of handling resubmission for engine crashes, and it would also be nice if there is a way to use the cluster resources only for the duration of the simulation.

like image 797
KartMan Avatar asked Nov 13 '22 01:11

KartMan


1 Answers

This comes a little late, and it's not actually answering your specific question. However, have you tried with pythongrid?

like image 84
GermanK Avatar answered Jan 04 '23 01:01

GermanK