Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Multiprocessing Documentation Example

I'm trying to learn Python multiprocessing.

http://docs.python.org/2/library/multiprocessing.html from the example of "To show the individual process IDs involved, here is an expanded example:"

from multiprocessing import Process
import os

def info(title):
    print title
    print 'module name:', __name__
    if hasattr(os, 'getppid'):  # only available on Unix
        print 'parent process:', os.getppid()
    print 'process id:', os.getpid()

def f(name):
    info('function f')
    print 'hello', name

if __name__ == '__main__':
    info('main line')
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

What exactly am I looking at? I see that def f(name): is called after info('main line') is finished, but this synchronous call would be default anyways. I see that the same process info('main line') is the parent PID of def f(name): but not sure what is 'multiprocessing' about that.

Also, with join() "Block the calling thread until the process whose join() method is called terminates". I'm not clear on what the calling thread would be. In this example what would join() be blocking?

like image 683
dman Avatar asked Aug 11 '13 05:08

dman


1 Answers

How multiprocessing works, in a nutshell:

  • Process() spawns (fork or similar on Unix-like systems) a copy of the original program (on Windows, which lacks a real fork, this is tricky and requires the special care that the module documentation notes).
  • The copy communicates with the original to figure out that (a) it's a copy and (b) it should go off and invoke the target= function (see below).
  • At this point, the original and copy are now different and independent, and can run simultaneously.

Since these are independent processes, they now have independent Global Interpreter Locks (in CPython) so both can use up to 100% of a CPU on a multi-cpu box, as long as they don't contend for other lower-level (OS) resources. That's the "multiprocessing" part.

Of course, at some point you have to send data back and forth between these supposedly-independent processes, e.g., to send results from one (or many) worker process(es) back to a "main" process. (There is the occasional exception where everyone's completely independent, but it's rare ... plus there's the whole start-up sequence itself, kicked off by p.start().) So each created Process instance—p, in the above example—has a communications channel to its parent creator and vice versa (it's a symmetric connection). The multiprocessing module uses the pickle module to turn data into strings—the same strings you can stash in files with pickle.dump—and sends the data across the channel, "downwards" to workers to send arguments and such, and "upwards" from workers to send back results.

Eventually, once you're all done with getting results, the worker finishes (by returning from the target= function) and tells the parent it's done. To make sure everything gets closed and cleaned-up, the parent should call p.join() to wait for the worker's "I'm done" message (actually an OS-level exit on Unix-ish sysems).

The example is a little bit silly since the two printed messages take basically no time at all, so running them "at the same time" has no measurable gain. But suppose instead of just printing hello, f were to calculate the first 100,000 digits of π (3.14159...). You could then spawn another Process, p2 with a different target g that calculates the first 100,000 digits of e (2.71828...). These would run independently. The parent could then call p.join() and p2.join() to wait for both to complete (or spawn yet more workers to do more work and occupy more CPUs, or even go off and do its own work for a while first).

like image 56
torek Avatar answered Sep 30 '22 18:09

torek