Sharing array of objects with Python multiprocessing

Tags:

For this question, I refer to the example in Python docs discussing the "use of the SharedMemory class with NumPy arrays, accessing the same numpy.ndarray from two distinct Python shells".

A major change that I'd like to implement is manipulate array of class objects rather than integer values as I demonstrate below.

import numpy as np
from multiprocessing import shared_memory    

# a simplistic class example
class A(): 
    def __init__(self, x): 
        self.x = x

# numpy array of class objects 
a = np.array([A(1), A(2), A(3)])       

# create a shared memory instance
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='psm_test0')

# numpy array backed by shared memory
b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)                                    

# copy the original data into shared memory
b[:] = a[:]                                  

print(b)                                            

# array([<__main__.Foo object at 0x7fac56cd1190>,
#       <__main__.Foo object at 0x7fac56cd1970>,
#       <__main__.Foo object at 0x7fac56cd19a0>], dtype=object)

Now, in a different shell, we attach to the shared memory space and try to manipulate the contents of the array.

import numpy as np
from multiprocessing import shared_memory

# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')

c = np.ndarray((3,), dtype=object, buffer=existing_shm.buf)

Even before we are able to manipulate c, printing it will result in a segmentation fault. Indeed I can not expect to observe a behaviour that has not been written into the module, so my question is what can I do to work with a shared array of objects?

I'm currently pickling the list but protected read/writes add a fair bit of overhead. I've also tried using Namespace, which was quite slow because indexed writes are not allowed. Another idea could be to use share Ctypes Structure in a ShareableList but I wouldn't know where to start with that.

In addition there is also a design aspect: it appears that there is an open bug in shared_memory that may affect my implementation wherein I have several processes working on different elements of the array.

Is there a more scalable way of sharing a large list of objects between several processes so that at any given time all running processes interact with a unique object/element in the list?

UPDATE: At this point, I will also accept partial answers that talk about whether this can be achieved with Python at all.

730

asked Jul 24 '20 06:07

ope

1 Answers

So, I did a bit of research (Shared Memory Objects in Multiprocessing) and came up with a few ideas:

Pass numpy array of bytes

Serialize the objects, then save them as byte strings to a numpy array. Problematic here is that

One one needs to pass the data type from the creator of 'psm_test0' to any consumer of 'psm_test0'. This could be done with another shared memory though.
pickle and unpickle is essentailly like deepcopy, i.e. it actually copies the underlying data.

The code for the 'main' process reads:

import pickle
from multiprocessing import shared_memory
import numpy as np


# a simplistic class example
class A():
    def __init__(self, x):
        self.x = x

    def pickle(self):
        return pickle.dumps(self)

    @classmethod
    def unpickle(self, bts):
        return pickle.loads(bts)


if __name__ == '__main__':
    # Test pickling procedure
    a = A(1)
    print(A.unpickle(a.pickle()).x)
    # >>> 1

    # numpy array of byte strings
    a_arr = np.array([A(1).pickle(), A(2).pickle(), A('This is a really long test string which should exceed 42 bytes').pickle()])

    # create a shared memory instance
    shm = shared_memory.SharedMemory(
        create=True,
        size=a_arr.nbytes,
        name='psm_test0'
    )

    # numpy array backed by shared memory
    b_arr = np.ndarray(a_arr.shape, dtype=a_arr.dtype, buffer=shm.buf)

    # copy the original data into shared memory
    b_arr[:] = a_arr[:]

    print(b_arr.dtype)
    # 'S105'

and for the consumer

import numpy as np
from multiprocessing import shared_memory
from test import A

# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')

c = np.ndarray((3,), dtype='S105', buffer=existing_shm.buf)

# Test data transfer
arr = [a.x for a in list(map(A.unpickle, c))]
print(arr)
# [1, 2, ...]

I'd say you have a few ways of going forward:

Stay with simple data types.
Implement something using the C api, but I can't really help you there.
Use Rust
Use Mangers. You maybe loose out on some performance (I'd like to see a real benchmark though), but You can get a relatively safe and simple interface for shared objects.
Use Redis, which also has Python bindings...

answered Sep 28 '22 02:09

AlexNe

Related questions
                            
                                Error printing variables while debugging Cython
                            
                                github not rendering jupyter notebook python
                            
                                Achieving shell-like pipeline performance in Python
                            
                                Adding StandardScaler() of values as new column to DataFrame returns partly NaNs
                            
                                Does VSCode support Python .pyi files for IntelliSense?
                            
                                Change QDocketWidget hover title bar color with CSS
                            
                                Defining an API for complex View generator function (with many configurables)
                            
                                Using python lime as a udf on spark
                            
                                Singledispatch and type as an input argument
                            
                                Exhaustively get all the possible combinations of a word of three lettters
                            
                                Define an attribute of a dataclass with a reserved word "class" and serialize it
                            
                                mypy error: Callable has no attribute "__get__"
                            
                                Conditional Cumulative Sums in Pandas
                            
                                Infer generic lambda parameters from another generic lambda's parameters
                            
                                How to resolve a Field nested in another Type using Graphene GraphQL?
                            
                                Airflow giving log file does not exist error while running on Docker
                            
                                Sharing GPU memory between process on a same GPU with Pytorch
                            
                                Python: Ctypes how to check memory management
                            
                                Error on tensorflow cannot import name 'export_saved_model'
                            
                                Deserialization of large numpy arrays using pickle is order of magnitude slower than using numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sharing array of objects with Python multiprocessing

Tags:

python

multiprocessing

shared-memory

python-3.8

ope

People also ask

1 Answers

Pass numpy array of bytes

AlexNe

Recent Activity

Donate For Us