Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison of the multiprocessing module and pyro?

I use pyro for basic management of parallel jobs on a compute cluster. I just moved to a cluster where I will be responsible for using all the cores on each compute node. (On previous clusters, each core has been a separate node.) The python multiprocessing module seems like a good fit for this. I notice it can also be used for remote-process communication. If anyone has used both frameworks for remote-process communication, I'd be grateful to hear how they stack up against each other. The obvious benefit of the multiprocessing module is that it's built-in from 2.6. Apart from that, it's hard for me to tell which is better.

like image 523
Alex Coventry Avatar asked Jul 23 '09 13:07

Alex Coventry


1 Answers

EDIT: I'm changing my answer so you avoid pain. multiprocessing is immature, the docs on BaseManager are INCORRECT, and if you're an object-oriented thinker that wants to create shared objects on the fly at run-time, USE PYRO OR YOU WILL SERIOUSLY REGRET IT! If you are just doing functional programming using a shared queue that you register up front like all the stupid examples GOOD FOR YOU.

Short Answer

Multiprocessing:

  • Feels awkward doing object-oriented remote objects
  • Easy breezy crypto (authkey)
  • Over a network or just inter-process communication
  • No nameserver extra hassle like in Pyro (there are ways to get around this)
  • Edit: Can't "register" objects once the manager is instantiated!!??
  • Edit: If a server isn't not started, the client throws some "Invalid argument" exception instead of just saying "Failed to connect" WTF!?
  • Edit: BaseManager documentation is incorrect! There is no "start" method!?!
  • Edit: Very little examples as to how to use it.

Pyro:

  • Simple remote objects
  • Network comms only (loopback if local only)
  • Edit: This thing just WORKS, and it likes object-oriented object sharing, which makes me LIKE it
  • Edit: Why isn't THIS a part of the standard library instead of that multiprocessing piece of crap that tried to copy it and failed miserably?

Edit: The first time I answered this I had just dived into 2.6 multiprocessing. In the code I show below, the Texture class is registered and shared as a proxy, however the "data" attribute inside of it is NOT. So guess what happens, each process has a separate copy of the "data" attribute inside of the Texture proxy, despite what you might expect. I just spent untold amount of hours trying to figure out how a good pattern to create shared objects during run-time and I kept running in to brick walls. It has been quite confusing and frustrating. Maybe it's just me, but looking around at the scant examples people have attempted it doesn't look like it.

I'm having to make the painful decision of dropping multiprocessing library and preferring Pyro until multiprocessing is more mature. While initially I was excited to learn multiprocessing being built into python, I am now thoroughly disgusted with it and would rather install the Pyro package many many times with glee that such a beautiful library exists for python.

Long Answer

I have used Pyro in past projects and have been very happy with it. I have also started to work with multiprocessing new in 2.6.

With multiprocessing I found it a bit awkward to allow shared objects to be created as needed. It seems like, in its youth, the multiprocessing module has been more geared for functional programming as opposed to object-oriented. However this is not entirely true because it is possible to do, I'm just feeling constrained by the "register" calls.

For example:

manager.py:

from multiprocessing import Process
from multiprocessing.managers import BaseManager

class Texture(object):
   def __init__(self, data):
        self.data = data

   def setData(self, data):
      print "Calling set data %s" % (data)
      self.data = data

   def getData(self):
      return self.data

class TextureManager(BaseManager):
   def __init__(self, address=None, authkey=''):
      BaseManager.__init__(self, address, authkey)
      self.textures = {}

   def addTexture(self, name, texture):
      self.textures[name] = texture

   def hasTexture(self, name):
      return name in self.textures

server.py:

from multiprocessing import Process
from multiprocessing.managers import BaseManager
from manager import Texture, TextureManager

manager = TextureManager(address=('', 50000), authkey='hello')

def getTexture(name):
   if manager.hasTexture(name):
      return manager.textures[name]
   else:
      texture = Texture([0]*100)
      manager.addTexture(name, texture)
      manager.register(name, lambda: texture)

TextureManager.register("getTexture", getTexture)


if __name__ == "__main__":
   server = manager.get_server()
   server.serve_forever()

client.py:

from multiprocessing import Process
from multiprocessing.managers import BaseManager
from manager import Texture, TextureManager

if __name__ == "__main__":
   manager = TextureManager(address=('127.0.0.1', 50000), authkey='hello')
   manager.connect()
   TextureManager.register("getTexture")
   texture = manager.getTexture("texture2")
   data = [2] * 100
   texture.setData(data)
   print "data = %s" % (texture.getData())

The awkwardness I'm describing comes from server.py where I register a getTexture function to retrieve a function of a certain name from the TextureManager. As I'm going over this the awkwardness could probably be removed if I made the TextureManager a shareable object which creates/retrieves shareable textures. Meh I'm still playing, but you get the idea. I don't remember encountering this awkwardness using pyro, but there probably is a solution that's cleaner than the example above.

like image 84
manifest Avatar answered Sep 22 '22 10:09

manifest