Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is shared readonly data copied to different processes for multiprocessing?

The piece of code that I have looks some what like this:

glbl_array = # a 3 Gb array  def my_func( args, def_param = glbl_array):     #do stuff on args and def_param  if __name__ == '__main__':   pool = Pool(processes=4)   pool.map(my_func, range(1000)) 

Is there a way to make sure (or encourage) that the different processes does not get a copy of glbl_array but shares it. If there is no way to stop the copy I will go with a memmapped array, but my access patterns are not very regular, so I expect memmapped arrays to be slower. The above seemed like the first thing to try. This is on Linux. I just wanted some advice from Stackoverflow and do not want to annoy the sysadmin. Do you think it will help if the the second parameter is a genuine immutable object like glbl_array.tostring().

like image 625
san Avatar asked Apr 05 '11 08:04

san


People also ask

Is memory shared in multiprocessing?

shared_memory — Shared memory for direct access across processes. New in version 3.8. This module provides a class, SharedMemory , for the allocation and management of shared memory to be accessed by one or more processes on a multicore or symmetric multiprocessor (SMP) machine.

How does multiprocessing process work?

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.

What is multiprocess synchronization?

Synchronization between processes Multiprocessing is a package which supports spawning processes using an API. This package is used for both local and remote concurrencies. Using this module, programmer can use multiple processors on a given machine. It runs on Windows and UNIX os.


2 Answers

You can use the shared memory stuff from multiprocessing together with Numpy fairly easily:

import multiprocessing import ctypes import numpy as np  shared_array_base = multiprocessing.Array(ctypes.c_double, 10*10) shared_array = np.ctypeslib.as_array(shared_array_base.get_obj()) shared_array = shared_array.reshape(10, 10)  #-- edited 2015-05-01: the assert check below checks the wrong thing #   with recent versions of Numpy/multiprocessing. That no copy is made #   is indicated by the fact that the program prints the output shown below. ## No copy was made ##assert shared_array.base.base is shared_array_base.get_obj()  # Parallel processing def my_func(i, def_param=shared_array):     shared_array[i,:] = i  if __name__ == '__main__':     pool = multiprocessing.Pool(processes=4)     pool.map(my_func, range(10))      print shared_array

which prints

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]  [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]  [ 3.  3.  3.  3.  3.  3.  3.  3.  3.  3.]  [ 4.  4.  4.  4.  4.  4.  4.  4.  4.  4.]  [ 5.  5.  5.  5.  5.  5.  5.  5.  5.  5.]  [ 6.  6.  6.  6.  6.  6.  6.  6.  6.  6.]  [ 7.  7.  7.  7.  7.  7.  7.  7.  7.  7.]  [ 8.  8.  8.  8.  8.  8.  8.  8.  8.  8.]  [ 9.  9.  9.  9.  9.  9.  9.  9.  9.  9.]]

However, Linux has copy-on-write semantics on fork(), so even without using multiprocessing.Array, the data will not be copied unless it is written to.

like image 110
pv. Avatar answered Sep 21 '22 01:09

pv.


The following code works on Win7 and Mac (maybe on linux, but not tested).

import multiprocessing import ctypes import numpy as np  #-- edited 2015-05-01: the assert check below checks the wrong thing #   with recent versions of Numpy/multiprocessing. That no copy is made #   is indicated by the fact that the program prints the output shown below. ## No copy was made ##assert shared_array.base.base is shared_array_base.get_obj()  shared_array = None  def init(shared_array_base):     global shared_array     shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())     shared_array = shared_array.reshape(10, 10)  # Parallel processing def my_func(i):     shared_array[i, :] = i  if __name__ == '__main__':     shared_array_base = multiprocessing.Array(ctypes.c_double, 10*10)      pool = multiprocessing.Pool(processes=4, initializer=init, initargs=(shared_array_base,))     pool.map(my_func, range(10))      shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())     shared_array = shared_array.reshape(10, 10)     print shared_array 
like image 40
taku-y Avatar answered Sep 21 '22 01:09

taku-y