I'm trying to learn Python multiprocessing.
http://docs.python.org/2/library/multiprocessing.html from the example of "To show the individual process IDs involved, here is an expanded example:"
from multiprocessing import Process
import os
def info(title):
print title
print 'module name:', __name__
if hasattr(os, 'getppid'): # only available on Unix
print 'parent process:', os.getppid()
print 'process id:', os.getpid()
def f(name):
info('function f')
print 'hello', name
if __name__ == '__main__':
info('main line')
p = Process(target=f, args=('bob',))
p.start()
p.join()
What exactly am I looking at? I see that def f(name): is called after info('main line') is finished, but this synchronous call would be default anyways. I see that the same process info('main line') is the parent PID of def f(name): but not sure what is 'multiprocessing' about that.
Also, with join() "Block the calling thread until the process whose join() method is called terminates". I'm not clear on what the calling thread would be. In this example what would join() be blocking?
How multiprocessing
works, in a nutshell:
Process()
spawns (fork
or similar on Unix-like systems) a copy of the original program (on Windows, which lacks a real fork
, this is tricky and requires the special care that the module documentation notes).target=
function (see below).Since these are independent processes, they now have independent Global Interpreter Locks (in CPython) so both can use up to 100% of a CPU on a multi-cpu box, as long as they don't contend for other lower-level (OS) resources. That's the "multiprocessing" part.
Of course, at some point you have to send data back and forth between these supposedly-independent processes, e.g., to send results from one (or many) worker process(es) back to a "main" process. (There is the occasional exception where everyone's completely independent, but it's rare ... plus there's the whole start-up sequence itself, kicked off by p.start()
.) So each created Process
instance—p
, in the above example—has a communications channel to its parent creator and vice versa (it's a symmetric connection). The multiprocessing
module uses the pickle
module to turn data into strings—the same strings you can stash in files with pickle.dump
—and sends the data across the channel, "downwards" to workers to send arguments and such, and "upwards" from workers to send back results.
Eventually, once you're all done with getting results, the worker finishes (by returning from the target=
function) and tells the parent it's done. To make sure everything gets closed and cleaned-up, the parent should call p.join()
to wait for the worker's "I'm done" message (actually an OS-level exit
on Unix-ish sysems).
The example is a little bit silly since the two printed messages take basically no time at all, so running them "at the same time" has no measurable gain. But suppose instead of just printing hello
, f
were to calculate the first 100,000 digits of π (3.14159...). You could then spawn another Process
, p2
with a different target g
that calculates the first 100,000 digits of e (2.71828...). These would run independently. The parent could then call p.join()
and p2.join()
to wait for both to complete (or spawn yet more workers to do more work and occupy more CPUs, or even go off and do its own work for a while first).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With