Shared memory in multiprocessing

Tags:

I have three large lists. First contains bitarrays (module bitarray 0.8.0) and the other two contain arrays of integers.

l1=[bitarray 1, bitarray 2, ... ,bitarray n] l2=[array 1, array 2, ... , array n] l3=[array 1, array 2, ... , array n]

These data structures take quite a bit of RAM (~16GB total).

If i start 12 sub-processes using:

multiprocessing.Process(target=someFunction, args=(l1,l2,l3))

Does this mean that l1, l2 and l3 will be copied for each sub-process or will the sub-processes share these lists? Or to be more direct, will I use 16GB or 192GB of RAM?

someFunction will read some values from these lists and then performs some calculations based on the values read. The results will be returned to the parent-process. The lists l1, l2 and l3 will not be modified by someFunction.

Therefore i would assume that the sub-processes do not need and would not copy these huge lists but would instead just share them with the parent. Meaning that the program would take 16GB of RAM (regardless of how many sub-processes i start) due to the copy-on-write approach under linux? Am i correct or am i missing something that would cause the lists to be copied?

EDIT: I am still confused, after reading a bit more on the subject. On the one hand Linux uses copy-on-write, which should mean that no data is copied. On the other hand, accessing the object will change its ref-count (i am still unsure why and what does that mean). Even so, will the entire object be copied?

For example if i define someFunction as follows:

def someFunction(list1, list2, list3):     i=random.randint(0,99999)     print list1[i], list2[i], list3[i]

Would using this function mean that l1, l2 and l3 will be copied entirely for each sub-process?

Is there a way to check for this?

EDIT2 After reading a bit more and monitoring total memory usage of the system while sub-processes are running, it seems that entire objects are indeed copied for each sub-process. And it seems to be because reference counting.

The reference counting for l1, l2 and l3 is actually unneeded in my program. This is because l1, l2 and l3 will be kept in memory (unchanged) until the parent-process exits. There is no need to free the memory used by these lists until then. In fact i know for sure that the reference count will remain above 0 (for these lists and every object in these lists) until the program exits.

So now the question becomes, how can i make sure that the objects will not be copied to each sub-process? Can i perhaps disable reference counting for these lists and each object in these lists?

EDIT3 Just an additional note. Sub-processes do not need to modify l1, l2 and l3 or any objects in these lists. The sub-processes only need to be able to reference some of these objects without causing the memory to be copied for each sub-process.

797

asked Jan 02 '13 15:01

FableBlaze

2 Answers

Generally speaking, there are two ways to share the same data:

Multithreading
Shared memory

Python's multithreading is not suitable for CPU-bound tasks (because of the GIL), so the usual solution in that case is to go on multiprocessing. However, with this solution you need to explicitly share the data, using multiprocessing.Value and multiprocessing.Array.

Note that usually sharing data between processes may not be the best choice, because of all the synchronization issues; an approach involving actors exchanging messages is usually seen as a better choice. See also Python documentation:

As mentioned above, when doing concurrent programming it is usually best to avoid using shared state as far as possible. This is particularly true when using multiple processes.

However, if you really do need to use some shared data then multiprocessing provides a couple of ways of doing so.

In your case, you need to wrap l1, l2 and l3 in some way understandable by multiprocessing (e.g. by using a multiprocessing.Array), and then pass them as parameters.
Note also that, as you said you do not need write access, then you should pass lock=False while creating the objects, or all access will be still serialized.

answered Oct 09 '22 15:10

rob

Because this is still a very high result on google and no one else has mentioned it yet, I thought I would mention the new possibility of 'true' shared memory which was introduced in python version 3.8.0: https://docs.python.org/3/library/multiprocessing.shared_memory.html

I have here included a small contrived example (tested on linux) where numpy arrays are used, which is likely a very common use case:

# one dimension of the 2d array which is shared dim = 5000  import numpy as np from multiprocessing import shared_memory, Process, Lock from multiprocessing import cpu_count, current_process import time  lock = Lock()  def add_one(shr_name):      existing_shm = shared_memory.SharedMemory(name=shr_name)     np_array = np.ndarray((dim, dim,), dtype=np.int64, buffer=existing_shm.buf)     lock.acquire()     np_array[:] = np_array[0] + 1     lock.release()     time.sleep(10) # pause, to see the memory usage in top     print('added one')     existing_shm.close()  def create_shared_block():      a = np.ones(shape=(dim, dim), dtype=np.int64)  # Start with an existing NumPy array      shm = shared_memory.SharedMemory(create=True, size=a.nbytes)     # # Now create a NumPy array backed by shared memory     np_array = np.ndarray(a.shape, dtype=np.int64, buffer=shm.buf)     np_array[:] = a[:]  # Copy the original data into shared memory     return shm, np_array  if current_process().name == "MainProcess":     print("creating shared block")     shr, np_array = create_shared_block()      processes = []     for i in range(cpu_count()):         _process = Process(target=add_one, args=(shr.name,))         processes.append(_process)         _process.start()      for _process in processes:         _process.join()      print("Final array")     print(np_array[:10])     print(np_array[10:])      shr.close()     shr.unlink()

Note that because of the 64 bit ints this code can take about 1gb of ram to run, so make sure that you won't freeze your system using it. ^_^

answered Oct 09 '22 15:10

Rboreal_Frippery

Related questions
                            
                                MySQL: Get column name or alias from query
                            
                                Django test runner not finding tests
                            
                                Is there a standard way to create Debian packages for distributing Python programs?
                            
                                Is there a Rake equivalent in Python?
                            
                                Python Set Comprehension
                            
                                What does metavar and action mean in argparse in Python?
                            
                                PermissionError: [Errno 13] in Python
                            
                                Django or Django Rest Framework
                            
                                Raise exception vs. return None in functions? [duplicate]
                            
                                Understanding timedelta
                            
                                how to turn on minor ticks only on y axis matplotlib
                            
                                Interpreting a benchmark in C, Clojure, Python, Ruby, Scala and others [closed]
                            
                                Boolean identity == True vs is True
                            
                                How to check the version of scipy
                            
                                Using multiprocessing.Process with a maximum number of simultaneous processes
                            
                                Get a sub-set of a Python dictionary
                            
                                How can I plot NaN values as a special color with imshow in matplotlib?
                            
                                Is it possible to run function in a subprocess without threading or writing a separate file/script.
                            
                                reading tar file contents without untarring it, in python script
                            
                                converting JSON to string in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Shared memory in multiprocessing

Tags:

python

large-data

multiprocessing

shared-memory

FableBlaze

People also ask

2 Answers

rob

Rboreal_Frippery

Recent Activity

Donate For Us