Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a parallel programming framework, what have I missed?

Clarification: As per some of the comments, I should clarify that this is intended as a simple framework to allow execution of programs that are naturally parallel (so-called embarrassingly parallel programs). It isn't, and never will be, a solution for tasks which require communication or synchronisation between processes.

I've been looking for a simple process-based parallel programming environment in Python that can execute a function on multiple CPUs on a cluster, with the major criterion being that it needs to be able to execute unmodified Python code. The closest I found was Parallel Python, but pp does some pretty funky things, which can cause the code to not be executed in the correct context (with the appropriate modules imported etc).

I finally got tired of searching, so I decided to write my own. What I came up with is actually quite simple. The problem is, I'm not sure if what I've come up with is simple because I've failed to think of a lot of things. Here's what my program does:

  • I have a job server which hands out jobs to nodes in the cluster.
  • The jobs are handed out to servers listening on nodes by passing a dictionary that looks like this:

    {
    'moduleName':'some_module', 
    'funcName':'someFunction', 
    'localVars': {'someVar':someVal,...}, 
    'globalVars':{'someOtherVar':someOtherVal,...}, 
    'modulePath':'/a/path/to/a/directory', 
    'customPathHasPriority':aBoolean, 
    'args':(arg1,arg2,...), 
    'kwargs':{'kw1':val1, 'kw2':val2,...}
    }
    
  • moduleName and funcName are mandatory, and the others are optional.

  • A node server takes this dictionary and does:

    sys.path.append(modulePath)
    globals()[moduleName]=__import__(moduleName, localVars, globalVars)
    returnVal = globals()[moduleName].__dict__[funcName](*args, **kwargs)
    
  • On getting the return value, the server then sends it back to the job server which puts it into a thread-safe queue.

  • When the last job returns, the job server writes the output to a file and quits.

I'm sure there are niggles that need to be worked out, but is there anything obvious wrong with this approach? On first glance, it seems robust, requiring only that the nodes have access to the filesystem(s) containing the .py file and the dependencies. Using __import__ has the advantage that the code in the module is automatically run, and so the function should execute in the correct context.

Any suggestions or criticism would be greatly appreciated.

EDIT: I should mention that I've got the code-execution bit working, but the server and job server have yet to be written.

like image 879
Chinmay Kanchi Avatar asked Nov 01 '10 22:11

Chinmay Kanchi


2 Answers

I have actually written something that probably satisfies your needs: jug. If it does not solve your problems, I promise you I'll fix any bugs you find.

The architecture is slightly different: workers all run the same code, but they effectively generate a similar dictionary and ask the central backend "has this been run?". If not, they run it (there is a locking mechanism too). The backend can simply be the filesystem if you are on an NFS system.

like image 111
luispedro Avatar answered Sep 21 '22 15:09

luispedro


I myself have been tinkering with batch image manipulation across my computers, and my biggest problem was the fact that some things don't easily or natively pickle and transmit across the network.

for example: pygame's surfaces don't pickle. these I have to convert to strings by saving them in StringIO objects and then dumping it across the network.

If the data you are transmitting (eg your arguments) can be transmitted without fear, you should not have that many problems with network data.

Another thing comes to mind: what do you plan to do if a computer suddenly "disappears" while doing a task? while returning the data? do you have a plan for re-sending tasks?

like image 33
admalledd Avatar answered Sep 19 '22 15:09

admalledd