I am trying to use <code>forkserver</code> and I encountered <code>NameError: name 'xxx' is not defined</code> in worker processes. I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that: <blockquote> The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited. </blockquote> Also, it says: <blockquote> Better to inherit than pickle/unpickle When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process. </blockquote> So apparently a key object that my worker process needs to work on did not get inherited by the server process and then passing to workers, why did that happen? I wonder what exactly gets inherited by forkserver process from parent process? Here is what my code looks like: <pre class="prettyprint"><code>import multiprocessing import (a bunch of other modules) def worker_func(nameList): global largeObject for item in nameList: # get some info from largeObject using item as index # do some calculation return [item, info] if __name__ == '__main__': result = [] largeObject # This is my large object, it's read-only and no modification will be made to it. nameList # Here is a list variable that I will need to get info for each item in it from the largeObject ctx_in_main = multiprocessing.get_context('forkserver') print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context()) cores = ctx_in_main.cpu_count() with ctx_in_main.Pool(processes=4) as pool: for x in pool.imap_unordered(worker_func, nameList): result.append(x) </code></pre> Thank you! Best,

<h3>Theory</h3> Below is an excerpt from Bojan Nikolic blog <blockquote> Modern Python versions (on Linux) provide three ways of starting the separate processes: <ol> <li> Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex </li> <li> Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh. </li> <li> The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state). </li> </ol> </blockquote> <blockquote> <h3>Forkserver</h3> The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the <code>set_forkserver_preload()</code> method. <img src="https://i.stack.imgur.com/BOg69.jpg" alt="enter image description here"> </blockquote> <h3>Practice</h3> Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of <code>set_forkserver_preload(modules_names)</code>, which set list of module names to try to load in forkserver process. I give an example below: <pre class="prettyprint"><code># inherited.py large_obj = {"one": 1, "two": 2, "three": 3} </code></pre> <pre class="prettyprint"><code># main.py import multiprocessing import os from time import sleep from inherited import large_obj def worker_func(key: str): print(os.getpid(), id(large_obj)) sleep(1) return large_obj[key] if __name__ == '__main__': result = [] ctx_in_main = multiprocessing.get_context('forkserver') ctx_in_main.set_forkserver_preload(['inherited']) cores = ctx_in_main.cpu_count() with ctx_in_main.Pool(processes=cores) as pool: for x in pool.imap(worker_func, ["one", "two", "three"]): result.append(x) for res in result: print(res) </code></pre> Output: <pre class="prettyprint"><code># The PIDs are different but the address is always the same PID=18603, obj id=139913466185024 PID=18604, obj id=139913466185024 PID=18605, obj id=139913466185024 </code></pre> And if we don't use preloading <pre class="prettyprint"><code>... ctx_in_main = multiprocessing.get_context('forkserver') # ctx_in_main.set_forkserver_preload(['inherited']) cores = ctx_in_main.cpu_count() ... </code></pre> <pre class="prettyprint"><code># The PIDs are different, the addresses are different too # (but sometimes they can coincide) PID=19046, obj id=140011789067776 PID=19047, obj id=140011789030976 PID=19048, obj id=140011789030912 </code></pre>

So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process? Basically when the server process starts, it will import your main module and everything before <code>if __name__ == '__main__'</code> will be executed. That's why my code don't work, because <code>large_object</code> is nowhere to be found in <code>server</code> process and in all those worker processes that fork from the <code>server</code> process. Alex's solution works because <code>large_object</code> now gets imported to both main and server process so every worker forked from server will also gets <code>large_object</code>. If combined with <code>set_forkserver_preload(modules_names)</code> all workers might even get the same <code>large_object</code> from what I saw. The reason for using <code>forkserver</code> is explicitly explained in Python documentations and in Bojan's blog: <blockquote> When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited. The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state). </blockquote> So it's more on the safe side of concern here. On a side note, if you use <code>fork</code> as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-<code>copy-on-write</code>, please correct me if I am wrong). In this case using <code>global large_object</code> will get you access to <code>large_object</code> in <code>worker_func</code> directly. The <code>forkserver</code> might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me <code>large_object</code> in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes. If I put all those calculations directly into <code>inherited.py</code> as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get <code>large_object</code>, this won't work. And putting those calculations in <code>__main__</code> in <code>inherited.py</code> won't work either since now none of the processes will execute them, including main and server. So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do <code>calculation.py</code> first, pickle the <code>large_object</code>, exit the interpreter, and start a fresh one to load the pickled <code>large_object</code>. Then I can just go nuts with either <code>fork</code> or <code>forkserver</code>.

multiprocessing in python - what gets inherited by forkserver process from parent process?

Tags:

python

global

multiprocessing

multiprocess

I am trying to use forkserver and I encountered NameError: name 'xxx' is not defined in worker processes.

I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that:

The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.

Also, it says:

Better to inherit than pickle/unpickle

When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.

So apparently a key object that my worker process needs to work on did not get inherited by the server process and then passing to workers, why did that happen? I wonder what exactly gets inherited by forkserver process from parent process?

Here is what my code looks like:

import multiprocessing
import (a bunch of other modules)

def worker_func(nameList):
    global largeObject
    for item in nameList:
        # get some info from largeObject using item as index
        # do some calculation
        return [item, info]

if __name__ == '__main__':
    result = []
    largeObject # This is my large object, it's read-only and no modification will be made to it.
    nameList # Here is a list variable that I will need to get info for each item in it from the largeObject    
    ctx_in_main = multiprocessing.get_context('forkserver')
    print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context())
    cores = ctx_in_main.cpu_count()
    with ctx_in_main.Pool(processes=4) as pool:
        for x in pool.imap_unordered(worker_func, nameList):
            result.append(x)

Thank you!

Best,

484

asked Aug 15 '20 08:08

sgyzetrov

2 Answers

Theory

Below is an excerpt from Bojan Nikolic blog

Modern Python versions (on Linux) provide three ways of starting the separate processes:

Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex

Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh.

The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).

Forkserver

The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the set_forkserver_preload() method.

Practice

Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of set_forkserver_preload(modules_names), which set list of module names to try to load in forkserver process. I give an example below:

# inherited.py
large_obj = {"one": 1, "two": 2, "three": 3}

# main.py
import multiprocessing
import os
from time import sleep

from inherited import large_obj


def worker_func(key: str):
    print(os.getpid(), id(large_obj))
    sleep(1)
    return large_obj[key]


if __name__ == '__main__':
    result = []
    ctx_in_main = multiprocessing.get_context('forkserver')
    ctx_in_main.set_forkserver_preload(['inherited'])
    cores = ctx_in_main.cpu_count()
    with ctx_in_main.Pool(processes=cores) as pool:
        for x in pool.imap(worker_func, ["one", "two", "three"]):
            result.append(x)
    for res in result:
        print(res)

Output:

# The PIDs are different but the address is always the same
PID=18603, obj id=139913466185024
PID=18604, obj id=139913466185024
PID=18605, obj id=139913466185024

And if we don't use preloading

...
    ctx_in_main = multiprocessing.get_context('forkserver')
    # ctx_in_main.set_forkserver_preload(['inherited']) 
    cores = ctx_in_main.cpu_count()
...

# The PIDs are different, the addresses are different too
# (but sometimes they can coincide)
PID=19046, obj id=140011789067776
PID=19047, obj id=140011789030976
PID=19048, obj id=140011789030912

100

answered Sep 25 '22 02:09

alex_noname

So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process?

Basically when the server process starts, it will import your main module and everything before if __name__ == '__main__' will be executed. That's why my code don't work, because large_object is nowhere to be found in server process and in all those worker processes that fork from the server process.

Alex's solution works because large_object now gets imported to both main and server process so every worker forked from server will also gets large_object. If combined with set_forkserver_preload(modules_names) all workers might even get the same large_object from what I saw. The reason for using forkserver is explicitly explained in Python documentations and in Bojan's blog:

When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.

The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).

So it's more on the safe side of concern here.

On a side note, if you use fork as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-copy-on-write, please correct me if I am wrong). In this case using global large_object will get you access to large_object in worker_func directly.

The forkserver might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me large_object in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes.

If I put all those calculations directly into inherited.py as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get large_object, this won't work. And putting those calculations in __main__ in inherited.py won't work either since now none of the processes will execute them, including main and server.

So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do calculation.py first, pickle the large_object, exit the interpreter, and start a fresh one to load the pickled large_object. Then I can just go nuts with either fork or forkserver.

answered Sep 26 '22 02:09

sgyzetrov

Related questions
                            
                                How do I run a program installed with pip in windows?
                            
                                Comparing Plumbr to other options for making a chart with R in a Python script
                            
                                Tkinter progress bar how to correctly implement it in a model dialog box
                            
                                A quick way to write a decision into a column based on the corresponding rows using pandas?
                            
                                Changing in the Quantity of variants reflecting in the wrong item in Order Summary
                            
                                Google Collab How to show value of assignments?
                            
                                Even though tuples are immutable, they are stored in different addresses in interactive mode. Why?
                            
                                Delete an element from torch.Tensor
                            
                                Why does django's `apps.get_model()` return a `__fake__.MyModel` object
                            
                                ValueError: illegal value in 4-th argument of internal None when running sklearn LinearRegression().fit()
                            
                                how to download all the python packages mentioned in the requirement.txt to a folder in linux?
                            
                                Create CSV from XML/Json using Python Pandas
                            
                                Length (count) of sequences with start and end condition Python
                            
                                Unexpected number of bins in Pandas DataFrame resample
                            
                                Convert epoch, which is midnight 01/01/0001, to DateTime in pandas
                            
                                matplotlib text: Use data coords for x, axis coords for y
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                Trying to understand __init__.py combined with getattr
                            
                                Implementing inplace operations for methods in a class
                            
                                How can I list the extra features of a Python package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

multiprocessing in python - what gets inherited by forkserver process from parent process?

Tags:

python

global

multiprocessing

multiprocess

sgyzetrov

People also ask

2 Answers

Theory

Forkserver

Practice

alex_noname

sgyzetrov

Recent Activity

Donate For Us