I have a very large (read only) array of data that I want to be processed by multiple processes in parallel.
I like the Pool.map
function and would like to use it to calculate functions on that data in parallel.
I saw that one can use the Value
or Array
class to use shared memory data between processes. But when I try to use this I get a RuntimeError: 'SynchronizedString objects should only be shared between processes through inheritance
when using the Pool.map function:
Here is a simplified example of what I am trying to do:
from sys import stdin
from multiprocessing import Pool, Array
def count_it( arr, key ):
count = 0
for c in arr:
if c == key:
count += 1
return count
if __name__ == '__main__':
testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
# want to share it using shared memory
toShare = Array('c', testData)
# this works
print count_it( toShare, "a" )
pool = Pool()
# RuntimeError here
print pool.map( count_it, [(toShare,key) for key in ["a", "b", "s", "d"]] )
Can anyone tell me what I am doing wrong here?
So what I would like to do is pass info about a newly created shared memory allocated array to the processes after they have been created in the process pool.
Let’s talk about synchronization in Python. Multithreading allows your computer to perform actions in parallel, utilizing multiple cores/ multiple CPUs present on your system. However, when it comes to reading and updating shared variables at the same time, it can lead to erroneous results.
This time, if we run the code we will get the following error: RuntimeError: Queue objects should only be shared between processes through inheritance. The multiprocessing.Manager will enable us to manage the queue and to also make it accessible to different workers:
multiprocessing.Manager ¶ Returns a started SyncManager object which can be used for sharing objects between processes. The returned manager object corresponds to a spawned child process and has methods which will create shared objects and return corresponding proxies.
If lock is None (the default) then a multiprocessing.RLock object is created automatically. A synchronized wrapper will have two methods in addition to those of the object it wraps: get_obj () returns the wrapped object and get_lock () returns the lock object used for synchronization.
If the data is read only just make it a variable in a module before the fork from Pool. Then all the child processes should be able to access it, and it won't be copied provided you don't write to it.
import myglobals # anything (empty .py file)
myglobals.data = []
def count_it( key ):
count = 0
for c in myglobals.data:
if c == key:
count += 1
return count
if __name__ == '__main__':
myglobals.data = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
pool = Pool()
print pool.map( count_it, ["a", "b", "s", "d"] )
If you do want to try to use Array though you could try with the lock=False
keyword argument (it is true by default).
The problem I see is that Pool doesn't support pickling shared data through its argument list. That's what the error message means by "objects should only be shared between processes through inheritance". The shared data needs to be inherited, i.e., global if you want to share it using the Pool class.
If you need to pass them explicitly, you may have to use multiprocessing.Process. Here is your reworked example:
from multiprocessing import Process, Array, Queue
def count_it( q, arr, key ):
count = 0
for c in arr:
if c == key:
count += 1
q.put((key, count))
if __name__ == '__main__':
testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
# want to share it using shared memory
toShare = Array('c', testData)
q = Queue()
keys = ['a', 'b', 's', 'd']
workers = [Process(target=count_it, args = (q, toShare, key))
for key in keys]
for p in workers:
p.start()
for p in workers:
p.join()
while not q.empty():
print q.get(),
Output: ('s', 9) ('a', 2) ('b', 3) ('d', 12)
The ordering of elements of the queue may vary.
To make this more generic and similar to Pool, you could create a fixed N number of Processes, split the list of keys into N pieces, and then use a wrapper function as the Process target, which will call count_it for each key in the list it is passed, like:
def wrapper( q, arr, keys ):
for k in keys:
count_it(q, arr, k)
If you're seeing:
RuntimeError: Synchronized objects should only be shared between processes through inheritance
Consider using multiprocessing.Manager
as it doesn't have this limitation. The manager works considering it presumably runs in a separate process altogether.
import ctypes
import multiprocessing
# Put this in a method or function, otherwise it will run on import from each module:
manager = multiprocessing.Manager()
counter = manager.Value(ctypes.c_ulonglong, 0)
counter_lock = manager.Lock() # pylint: disable=no-member
with counter_lock:
counter.value = count = counter.value + 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With