Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining exactly what is pickled during Python multiprocessing

Tags:

As explained in the thread What is being pickled when I call multiprocessing.Process? there are circumstances where multiprocessing requires little to no data to be transferred via pickling. For example, on Unix systems, the interpreter uses fork() to create the processes, and objects which already exist when multiprocessing starts can be accessed by each process without pickling.

However, I'm trying to consider scenarios beyond "here's how it's supposed to work". For example, the code may have a bug and an object which is supposed to read-only is inadvertently modified, leading to its pickling to be transferred to other processes.

Is there some way to determine what, or at least how much, is pickled during multiprocessing? The information doesn't necessarily have to be in real-time, but it would be helpful if there was a way to get some statistics (e.g., number of objects pickled) which might give a hint as to why something is taking longer to run than intended because of unexpected pickling overhead.

I'm looking for a solution internal to the Python environment. Process tracing (e.g. Linux strace), network snooping, generalized IPC statistics, and similar solutions that might be used to count the number of bytes moving between processes aren't going to be specific enough to identify object pickling versus other types of communication.


Updated: Disappointingly, there appears to be no way to gather pickling statistics short of hacking up the module and/or interpreter sources. However, @aaron does explain this and clarified a few minor points, so I have accepted the answer.

like image 959
sj95126 Avatar asked Aug 16 '21 17:08

sj95126


People also ask

Does multiprocessing use pickle?

However, the multiprocess tasks can't be pickled; it would raise an error failing to pickle. That's because when dividing a single task over multiprocess, these might need to share data; however, it doesn't share memory space.

What is pickled data in Python?

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

How do I check if a process is running in Python multiprocessing?

We can check if a process is alive via the multiprocessing. Process. is_alive() method.

What can you tell about multiprocessing in Python?

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.


1 Answers

Multiprocessing isn't exactly a simple library, but once you're familiar with how it works, it's pretty easy to poke around and figure it out.

You usually want to start with context.py. This is where all the useful classes get bound depending on OS, and... well... the "context" you have active. There are 4 basic contexts: Fork, ForkServer, and Spawn for posix; and a separate Spawn for windows. These in turn each have their own "Popen" (called at start()) to launch a new process to handle the separate implementations.

popen_fork.py

creating a process literally calls os.fork(), and then in the child organizes to run BaseProcess._bootstrap() which sets up some cleanup stuff then calls self.run() to execute the code you give it. No pickling occurs to start a process this way because the entire memory space gets copied (with some exceptions. see: fork(2)).

popen_spawn_xxxxx.py

I am most familiar with windows, but I assume both the win32 and posix versions operate in a very similar manner. A new python process is created with a simple crafted command line string including a pair of pipe handles to read/write from/to. The new process will import the __main__ module (generally equal to sys.argv[0]) in order to have access to all the needed references. Then it will execute a simple bootstrap function (from the command string) that attempts to read and un-pickle a Process object from its pipe it was created with. Once it has the Process instance (a new object which is a copy; not just a reference to the original), it will again arrange to call _bootstrap().

popen_forkserver.py

The first time a new process is created with the "forkserver" context, a new process will be "spawn"ed running a simple server (listening on a pipe) which handles new process requests. Subsequent process requests all go to the same server (based on import mechanics and a module-level global for the server instance). New processes are then "fork"ed from that server in order to save the time of spinning up a new python instance. These new processes however can't have any of the same (as in same object and not a copy) Process objects because the python process they were forked from was itself "spawn"ed. Therefore the Process instance is pickled and sent much like with "spawn". The benefits of this method include: The process doing the forking is single threaded to avoid deadlocks. The cost of spinning up a new python interpreter is only paid once. The memory consumption of the interpreter, and any modules imported by __main__ can largely be shared due to "fork" generally using copy-on-write memory pages.


In all cases, once the split has occurred, you should consider the memory spaces totally separate, and the only communication between them is via pipes or shared memory. Locks and Semaphores are handled by an extension library (written in c), but are basically named semaphores managed by the OS. Queue's, Pipe's and multiprocessing.Manager's use pickling to synchronize changes to the proxy objects they return. The new-ish multiprocessing.shared_memory uses a memory-mapped file or buffer to share data (managed by the OS like semaphores).

To address your concern:

the code may have a bug and an object which is supposed to read-only is inadvertently modified, leading to its pickling to be transferred to other processes.

This only really applies to multiprocessing.Manager proxy objects. As everything else requires you to be very intentional about sending and receiveing data, or instead uses some other transfer mechanism than pickling.

like image 153
Aaron Avatar answered Sep 18 '22 21:09

Aaron