multiprocessing - sharing a complex object

Tags:

I've got a large dict-like object that needs to be shared between a number of worker processes. Each worker reads a random subset of the information in the object and does some computation with it. I'd like to avoid copying the large object as my machine quickly runs out of memory.

I was playing with the code for this SO question and I modified it a bit to use a fixed-size process pool, which is better suited to my use case. This however seems to break it.

from multiprocessing import Process, Pool
from multiprocessing.managers import BaseManager

class numeri(object):
    def __init__(self):
        self.nl = []

    def getLen(self):
        return len(self.nl)

    def stampa(self):
        print self.nl

    def appendi(self, x):
        self.nl.append(x)

    def svuota(self):
        for i in range(len(self.nl)):
            del self.nl[0]

class numManager(BaseManager):
    pass

def produce(listaNumeri):
    print 'producing', id(listaNumeri)
    return id(listaNumeri)

def main():
    numManager.register('numeri', numeri, exposed=['getLen', 'appendi',
                        'svuota', 'stampa'])
    mymanager = numManager()
    mymanager.start()
    listaNumeri = mymanager.numeri()
    print id(listaNumeri)

    print '------------ Process'
    for i in range(5):
        producer = Process(target=produce, args=(listaNumeri,))
        producer.start()
        producer.join()

    print '--------------- Pool'
    pool = Pool(processes=1)
    for i in range(5):
        pool.apply_async(produce, args=(listaNumeri,)).get()

if __name__ == '__main__':
    main()

The output is

4315705168
------------ Process
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
--------------- Pool
producing 4299771152
producing 4315861712
producing 4299771152
producing 4315861712
producing 4299771152

As you can see, in the first case all worker processes get the same object (by id). In the second case, the id is not the same. Does that mean the object is being copied?

P.S. I don't think that matters, but I am using joblib, which internally used a Pool:

from joblib import delayed, Parallel

print '------------- Joblib'
        Parallel(n_jobs=4)(delayed(produce)(listaNumeri) for i in range(5))

which outputs:

------------- Joblib
producing 4315862096
producing 4315862288
producing 4315862480
producing 4315862672
producing 4315862352

995

asked Jan 06 '14 17:01

mbatchkarov

1 Answers

I'm afraid virtually nothing here works the way you hope it works :-(

First note that identical id() values produced by different processes tell you nothing about whether the objects are really the same object. Each process has its own virtual address space, assigned by the operating system. The same virtual address in two processes can refer to entirely different physical memory locations. Whether your code produces the same id() output or not is pretty much purely accidental. Across multiple runs, sometimes I see different id() output in your Process section and repeated id() output in your Pool section, or vice versa, or both.

Second, a Manager supplies semantic sharing but not physical sharing. The data for your numeri instance lives only in the manager process. All your worker processes see (copies of) proxy objects. Those are thin wrappers that forward all operations to be performed by the manager process. This involves lots of inter-process communication, and serialization inside the manager process. This is a great way to write really slow code ;-) Yes, there is only one copy of the numeri data, but all work on it is done by a single process (the manager process).

To see this more clearly, make the changes @martineau suggested, and also change get_list_id() to this:

def get_list_id(self):  # added method
    import os
    print("get_list_id() running in process", os.getpid())
    return id(self.nl)

Here's sample output:

41543664
------------ Process
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 46268496
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 44153904
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
--------------- Pool
producing 41639248
get_list_id() running in process 5856
with list_id 44544608
producing 41777200
get_list_id() running in process 5856
with list_id 44544608
producing 41776816
get_list_id() running in process 5856
with list_id 44544608
producing 41777168
get_list_id() running in process 5856
with list_id 44544608
producing 41777136
get_list_id() running in process 5856
with list_id 44544608

Clear? The reason you get the same list id each time is not because each worker process has the same self.nl member, it's because all numeri methods run in a single process (the manager process). That's why the list id is always the same.

If you're running on a Linux-y system (an OS that supports fork()), a much better idea is to forget all this Manager stuff and create your complex object at module level before starting any worker processes. Then the workers will inherit (address-space copies of) your complex object. The usual copy-on-write fork() semantics will make that about as memory-efficient as possible. That's sufficient if mutations don't need to be folded back into the main program's copy of the complex object. If mutations do need to be folded back in, then you're back to needing lots of inter-process communication, and multiprocessing becomes correspondingly less attractive.

There are no easy answers here. Don't shoot the messenger ;-)

159

answered Sep 23 '22 09:09

Tim Peters

Related questions
                            
                                what's the fastest way to select a function of Python via VIM?
                            
                                Get all __slots__ of derived class
                            
                                BoostBuild: patchlevel.h does not exist
                            
                                Python error: "IndexError: string index out of range"
                            
                                numpy recarray strings of variable length
                            
                                Multilevel relative import
                            
                                How to keep a C++ class name unmodified with Cython?
                            
                                Django Foreign Key: get related model?
                            
                                Reading bmp files in Python
                            
                                numpy.array boolean to binary?
                            
                                Can python threads access variables in the namespace?
                            
                                Python style for `chained` function calls
                            
                                what does @tornado.web.asynchronous decorator mean?
                            
                                matplotlib diagrams with 2 y-axis
                            
                                How to get Job by id in RQ python?
                            
                                SQLAlchemy filter in_ operator
                            
                                Python Flask with celery out of application context
                            
                                Tkinter & PIL Resize an image to fit a label
                            
                                How to correctly implement the mapping protocol in Python?
                            
                                Proper way of building Gtk3 applications in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

multiprocessing - sharing a complex object

Tags:

python

concurrency

multiprocessing

mbatchkarov

People also ask

1 Answers

Tim Peters

Recent Activity

Donate For Us