Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does python multiprocessing pickle objects to pass objects between processes?

Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.

like image 615
Michael Avatar asked Jul 03 '14 21:07

Michael


1 Answers

multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.

To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.

When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).

http://en.wikipedia.org/wiki/Shared_memory

http://en.wikipedia.org/wiki/Serialization

Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.

http://en.wikipedia.org/wiki/Unix_domain_socket

Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.

One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.

EDIT: @Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory

like image 115
steveha Avatar answered Oct 16 '22 11:10

steveha