Why is pickle needed for multiprocessing module in python

Tags:

I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?

Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?

705

asked Oct 01 '18 23:10

Gaaaaaaaa

1 Answers

Which makes me wonder why do we need to pickle the object in order to do multiprocessing?

We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).

Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).

isn't fork() enough?

Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).

I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?

Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".
Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).
There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.

177

answered Sep 28 '22 06:09

Matt Messersmith

Related questions
                            
                                Skip converting entities while loading a yaml string (using PyYAML)
                            
                                Initializing numpy array from np.empty
                            
                                numpy - efficiently copy values from matrix to matrix using some precalculated map
                            
                                Pandas: shifting columns depending on if NaN or not
                            
                                Best way to convert generator into iterator class
                            
                                How to use pandas to_csv float_format?
                            
                                How to debug python script in C level using GDB. Give me a simple example for this
                            
                                OpenGL render view without a visible window in python
                            
                                Edit image as tensorflow tensor python
                            
                                Tensorflow, Keras: How to create a trainable variable that only update in specific positions?
                            
                                Batch generating barcodes using ReportLab
                            
                                ImportError: No module named 'rospy'
                            
                                Simulating Time Series With Unobserved Components Model
                            
                                Add class information to Generator model in keras
                            
                                Aggregating an async generator to a tuple
                            
                                Getting selenium to work on pythonanywhere
                            
                                "E271: do not compare types, use isinstance()" error
                            
                                Google Cloud Function - ImportError: cannot import name 'pubsub' from 'google.cloud' (unknown location)
                            
                                How to run unittest test cases in the order they are declared
                            
                                mypy: Correct way of type-annotating list of multiple types

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is pickle needed for multiprocessing module in python

Tags:

python

multiprocessing

pickle

Gaaaaaaaa

People also ask

1 Answers

Matt Messersmith

Recent Activity

Donate For Us