I've got a large dict
-like object that needs to be shared between a number of worker processes. Each worker reads a random subset of the information in the object and does some computation with it. I'd like to avoid copying the large object as my machine quickly runs out of memory.
I was playing with the code for this SO question and I modified it a bit to use a fixed-size process pool, which is better suited to my use case. This however seems to break it.
from multiprocessing import Process, Pool
from multiprocessing.managers import BaseManager
class numeri(object):
def __init__(self):
self.nl = []
def getLen(self):
return len(self.nl)
def stampa(self):
print self.nl
def appendi(self, x):
self.nl.append(x)
def svuota(self):
for i in range(len(self.nl)):
del self.nl[0]
class numManager(BaseManager):
pass
def produce(listaNumeri):
print 'producing', id(listaNumeri)
return id(listaNumeri)
def main():
numManager.register('numeri', numeri, exposed=['getLen', 'appendi',
'svuota', 'stampa'])
mymanager = numManager()
mymanager.start()
listaNumeri = mymanager.numeri()
print id(listaNumeri)
print '------------ Process'
for i in range(5):
producer = Process(target=produce, args=(listaNumeri,))
producer.start()
producer.join()
print '--------------- Pool'
pool = Pool(processes=1)
for i in range(5):
pool.apply_async(produce, args=(listaNumeri,)).get()
if __name__ == '__main__':
main()
The output is
4315705168
------------ Process
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
--------------- Pool
producing 4299771152
producing 4315861712
producing 4299771152
producing 4315861712
producing 4299771152
As you can see, in the first case all worker processes get the same object (by id). In the second case, the id is not the same. Does that mean the object is being copied?
P.S. I don't think that matters, but I am using joblib
, which internally used a Pool
:
from joblib import delayed, Parallel
print '------------- Joblib'
Parallel(n_jobs=4)(delayed(produce)(listaNumeri) for i in range(5))
which outputs:
------------- Joblib
producing 4315862096
producing 4315862288
producing 4315862480
producing 4315862672
producing 4315862352
Multiprocessing is a general term that can mean the dynamic assignment of a program to one of two or more computers working in tandem or can involve multiple computers working on the same program at the same time (in parallel).
You can share a global variable with all child workers processes in the multiprocessing pool by defining it in the worker process initialization function. In this tutorial you will discover how to share global variables with all workers in the Python process pool.
Use the multiprocessing pool if your tasks are independent. This means that each task is not dependent on other tasks that could execute at the same time. It also may mean tasks that are not dependent on any data other than data provided via function arguments to the task.
multiprocessing is a drop in replacement for Python's multiprocessing module. It supports the exact same operations, but extends it, so that all tensors sent through a multiprocessing. Queue , will have their data moved into shared memory and will only send a handle to another process.
I'm afraid virtually nothing here works the way you hope it works :-(
First note that identical id()
values produced by different processes tell you nothing about whether the objects are really the same object. Each process has its own virtual address space, assigned by the operating system. The same virtual address in two processes can refer to entirely different physical memory locations. Whether your code produces the same id()
output or not is pretty much purely accidental. Across multiple runs, sometimes I see different id()
output in your Process
section and repeated id()
output in your Pool
section, or vice versa, or both.
Second, a Manager
supplies semantic sharing but not physical sharing. The data for your numeri
instance lives only in the manager process. All your worker processes see (copies of) proxy objects. Those are thin wrappers that forward all operations to be performed by the manager process. This involves lots of inter-process communication, and serialization inside the manager process. This is a great way to write really slow code ;-) Yes, there is only one copy of the numeri
data, but all work on it is done by a single process (the manager process).
To see this more clearly, make the changes @martineau suggested, and also change get_list_id()
to this:
def get_list_id(self): # added method
import os
print("get_list_id() running in process", os.getpid())
return id(self.nl)
Here's sample output:
41543664
------------ Process
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 46268496
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 44153904
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
--------------- Pool
producing 41639248
get_list_id() running in process 5856
with list_id 44544608
producing 41777200
get_list_id() running in process 5856
with list_id 44544608
producing 41776816
get_list_id() running in process 5856
with list_id 44544608
producing 41777168
get_list_id() running in process 5856
with list_id 44544608
producing 41777136
get_list_id() running in process 5856
with list_id 44544608
Clear? The reason you get the same list id each time is not because each worker process has the same self.nl
member, it's because all numeri
methods run in a single process (the manager process). That's why the list id is always the same.
If you're running on a Linux-y system (an OS that supports fork()
), a much better idea is to forget all this Manager
stuff and create your complex object at module level before starting any worker processes. Then the workers will inherit (address-space copies of) your complex object. The usual copy-on-write fork()
semantics will make that about as memory-efficient as possible. That's sufficient if mutations don't need to be folded back into the main program's copy of the complex object. If mutations do need to be folded back in, then you're back to needing lots of inter-process communication, and multiprocessing
becomes correspondingly less attractive.
There are no easy answers here. Don't shoot the messenger ;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With