Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pickle needed for multiprocessing module in python

I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?

Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?

like image 705
Gaaaaaaaa Avatar asked Oct 01 '18 23:10

Gaaaaaaaa


People also ask

Does multiprocessing use pickle?

However, the multiprocess tasks can't be pickled; it would raise an error failing to pickle. That's because when dividing a single task over multiprocess, these might need to share data; however, it doesn't share memory space.

Why do we need pickling in Python?

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

What is the use of pickling and Unpickling in Python?

Pickling is a process by which the object structure in Python is serialized. A Python object is converted into a byte stream when it undergoes pickling. Unpickling is a process by which original Python objects are retrieved from the stored string representation i.e., from the pickle file.

How does Python handle multiprocessing?

While using multiprocessing in Python, Pipes acts as the communication channel. Pipes are helpful when you want to initiate communication between multiple processes. They return two connection objects, one for each end of the Pipe, and use the send() & recv() methods to communicate.


1 Answers

Which makes me wonder why do we need to pickle the object in order to do multiprocessing?

We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).

Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).

isn't fork() enough?

Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).

I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?

  1. Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".

  2. Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).

  3. There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.

like image 177
Matt Messersmith Avatar answered Sep 28 '22 06:09

Matt Messersmith