Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python parallel map (multiprocessing.Pool.map) with global data

I'm trying to call a function on multiple processes. The obvious solution is python's multiprocessing module. The problem is that the function has side effects. It creates a temporary file and registers that file to be deleted on exit using the atexit.register and a global list. The following should demonstrate the problem (in a different context).

import multiprocessing as multi

glob_data=[]
def func(a):
    glob_data.append(a)

map(func,range(10))
print glob_data  #[0,1,2,3,4 ... , 9]  Good.

p=multi.Pool(processes=8)
p.map(func,range(80))

print glob_data  #[0,1,2,3,4, ... , 9] Bad, glob_data wasn't updated.

Is there any way to have the global data updated?

Note that if you try out the above script, you probably shouldn't try it from the interactive interpreter since multiprocessing requires the module __main__ to be importable by child processes.

UPDATE

Added the global keyword in func doesn't help -- e.g.:

def func(a):  #Still doesn't work.
    global glob_data
    glob_data.append(a)
like image 888
mgilson Avatar asked Mar 28 '12 16:03

mgilson


1 Answers

You need the list glob_data to be backed by shared memory, Multiprocessing's Manager gives you just that:

import multiprocessing as multi
from multiprocessing import Manager

manager = Manager()

glob_data = manager.list([])

def func(a):
    glob_data.append(a)

map(func,range(10))
print glob_data  # [0,1,2,3,4 ... , 9] Good.

p = multi.Pool(processes=8)
p.map(func,range(80))

print glob_data # Super Good.

For some background:

https://docs.python.org/3/library/multiprocessing.html#managers

like image 61
Rafael Ferreira Avatar answered Sep 22 '22 08:09

Rafael Ferreira