For this question, I refer to the example in Python docs discussing the "use of the SharedMemory class with NumPy arrays, accessing the same numpy.ndarray from two distinct Python shells".
A major change that I'd like to implement is manipulate array of class objects rather than integer values as I demonstrate below.
import numpy as np
from multiprocessing import shared_memory    
# a simplistic class example
class A(): 
    def __init__(self, x): 
        self.x = x
# numpy array of class objects 
a = np.array([A(1), A(2), A(3)])       
# create a shared memory instance
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='psm_test0')
# numpy array backed by shared memory
b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)                                    
# copy the original data into shared memory
b[:] = a[:]                                  
print(b)                                            
# array([<__main__.Foo object at 0x7fac56cd1190>,
#       <__main__.Foo object at 0x7fac56cd1970>,
#       <__main__.Foo object at 0x7fac56cd19a0>], dtype=object)
Now, in a different shell, we attach to the shared memory space and try to manipulate the contents of the array.
import numpy as np
from multiprocessing import shared_memory
# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')
c = np.ndarray((3,), dtype=object, buffer=existing_shm.buf)
Even before we are able to manipulate c, printing it will result in a segmentation fault. Indeed I can not expect to observe a behaviour that has not been written into the module, so my question is what can I do to work with a shared array of objects?
I'm currently pickling the list but protected read/writes add a fair bit of overhead. I've also tried using Namespace, which was quite slow because indexed writes are not allowed. Another idea could be to use share Ctypes Structure in a ShareableList but I wouldn't know where to start with that.
In addition there is also a design aspect: it appears that there is an open bug in shared_memory that may affect my implementation wherein I have several processes working on different elements of the array.
Is there a more scalable way of sharing a large list of objects between several processes so that at any given time all running processes interact with a unique object/element in the list?
UPDATE: At this point, I will also accept partial answers that talk about whether this can be achieved with Python at all.
multiprocessing is a drop in replacement for Python's multiprocessing module. It supports the exact same operations, but extends it, so that all tensors sent through a multiprocessing. Queue , will have their data moved into shared memory and will only send a handle to another process.
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.
So, I did a bit of research (Shared Memory Objects in Multiprocessing) and came up with a few ideas:
Serialize the objects, then save them as byte strings to a numpy array. Problematic here is that
One one needs to pass the data type from the creator of 'psm_test0' to any consumer of 'psm_test0'. This could be done with another shared memory though.
pickle and unpickle is essentailly like deepcopy, i.e. it actually copies the underlying data.
The code for the 'main' process reads:
import pickle
from multiprocessing import shared_memory
import numpy as np
# a simplistic class example
class A():
    def __init__(self, x):
        self.x = x
    def pickle(self):
        return pickle.dumps(self)
    @classmethod
    def unpickle(self, bts):
        return pickle.loads(bts)
if __name__ == '__main__':
    # Test pickling procedure
    a = A(1)
    print(A.unpickle(a.pickle()).x)
    # >>> 1
    # numpy array of byte strings
    a_arr = np.array([A(1).pickle(), A(2).pickle(), A('This is a really long test string which should exceed 42 bytes').pickle()])
    # create a shared memory instance
    shm = shared_memory.SharedMemory(
        create=True,
        size=a_arr.nbytes,
        name='psm_test0'
    )
    # numpy array backed by shared memory
    b_arr = np.ndarray(a_arr.shape, dtype=a_arr.dtype, buffer=shm.buf)
    # copy the original data into shared memory
    b_arr[:] = a_arr[:]
    print(b_arr.dtype)
    # 'S105'
and for the consumer
import numpy as np
from multiprocessing import shared_memory
from test import A
# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')
c = np.ndarray((3,), dtype='S105', buffer=existing_shm.buf)
# Test data transfer
arr = [a.x for a in list(map(A.unpickle, c))]
print(arr)
# [1, 2, ...]
I'd say you have a few ways of going forward:
Stay with simple data types.
Implement something using the C api, but I can't really help you there.
Use Rust
Use Mangers. You maybe loose out on some performance (I'd like to see a real benchmark though), but You can get a relatively safe and simple interface for shared objects.
Use Redis, which also has Python bindings...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With