How can I restrict the scope of a multiprocessing process?

Tags:

multiprocessing

Using python's multiprocessing module, the following contrived example runs with minimal memory requirements:

import multiprocessing 
# completely_unrelated_array = range(2**25)

def foo(x):
    for x in xrange(2**28):pass
    print x**2

P = multiprocessing.Pool()

for x in range(8):
    multiprocessing.Process(target=foo, args=(x,)).start()

Uncomment the creation of the completely_unrelated_array and you'll find that each spawned process allocates the memory for a copy of the completely_unrelated_array! This is a minimal example of a much larger project that I can't figure out how to workaround; multiprocessing seems to make a copy of everything that is global. I don't need a shared memory object, I simply need to pass in x, and process it without the memory overhead of the entire program.

Side observation: What's interesting is that print id(completely_unrelated_array) inside foo gives the same value, suggesting that somehow that might not be copies...

745

asked Aug 25 '14 00:08

Hooked

1 Answers

Because of the nature of os.fork(), any variables in the global namespace of your __main__ module will be inherited by the child processes (assuming you're on a Posix platform), so you'll see the memory usage in the children reflect that as soon as they're created. I'm not sure if all that memory is really being allocated though, as far as I know that memory is shared until you actually try to change it in the child, at which point a new copy is made. Windows, on the other hand, doesn't use os.fork() - it re-imports the main module in each child, and pickles any local variables you want sent to the children. So, using Windows you can actually avoid the large global ending up copied in the child by only defining it inside an if __name__ == "__main__": guard, because everything inside that guard will only run in the parent process:

import time
import multiprocessing 


def foo(x):
    for x in range(2**28):pass
    print(x**2)

if __name__ == "__main__":
    completely_unrelated_array = list(range(2**25)) # This will only be defined in the parent on Windows
    P = multiprocessing.Pool()

    for x in range(8):
        multiprocessing.Process(target=foo, args=(x,)).start()

Now, in Python 2.x, you can only create new multiprocessing.Process objects by forking if you're using a Posix platform. But on Python 3.4, you can specify how the new processes are created, by using contexts. So, we can specify the "spawn" context, which is the one Windows uses, to create our new processes, and use the same trick:

# Note that this is Python 3.4+ only
import time
import multiprocessing 

def foo(x):
    for x in range(2**28):pass
    print(x**2)


if __name__ == "__main__":
    completely_unrelated_array = list(range(2**23))  # Again, this only exists in the parent
    ctx = multiprocessing.get_context("spawn") # Use process spawning instead of fork
    P = ctx.Pool()

    for x in range(8):
        ctx.Process(target=foo, args=(x,)).start()

If you need 2.x support, or want to stick with using os.fork() to create new Process objects, I think the best you can do to get the reported memory usage down is immediately delete the offending object in the child:

import time
import multiprocessing 
import gc

def foo(x):
    init()
    for x in range(2**28):pass
    print(x**2)

def init():
    global completely_unrelated_array
    completely_unrelated_array = None
    del completely_unrelated_array
    gc.collect()

if __name__ == "__main__":
    completely_unrelated_array = list(range(2**23))
    P = multiprocessing.Pool(initializer=init)

    for x in range(8):
        multiprocessing.Process(target=foo, args=(x,)).start()
    time.sleep(100)

107

answered Nov 15 '22 18:11

dano

Related questions
                            
                                Get row index from DataFrame row
                            
                                Pandas' equivalent of resample for integer index
                            
                                How to profile multiple subprocesses using Python multiprocessing and memory_profiler?
                            
                                At which moment and how often are executed the __init__.py files by python
                            
                                Pandas multiply dataframes with multiindex and overlapping index levels
                            
                                Prevent package from being installed on old Python versions
                            
                                What is the difference between partial and partialmethod?
                            
                                Live updating only the data in Dash/plotly
                            
                                Poisson Regression in statsmodels and R
                            
                                What is the n parameter of tkinter.mainloop function?
                            
                                Graph disconnected: cannot obtain value for tensor Tensor Input Keras Python
                            
                                Distribute pre-compiled python extension module with distutils
                            
                                Send Ctrl-C to remote processes started via subprocess.Popen and ssh
                            
                                Using git to manage virtualenv state: will this cause problems?
                            
                                python multiprocessing arguments: deep copy?
                            
                                `DummyExecutor` for Python's `futures`
                            
                                How to use SQLAlchemy to seamlessly access multiple databases?
                            
                                Making pyplot.hist() first and last bins include outliers
                            
                                Django: how to set log level to INFO or DEBUG
                            
                                Why I am suddenly seeing `Usage: source deactivate` whenever I run virtualenvwrapper commands?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With