Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lock free read only List in Python?

I've done some basic performance and memory consumption benchmarks and I was wondering if there is any way to make things even faster...

  1. I have a giant 70,000 element list with a numpy ndarray, and the file path in a tuple in the said list.

  2. My first version passed a sliced up copy of the list to each of the processes in python multiprocess module, but it would explode ram usage to over 20+ Gigabytes

  3. The second version I moved it into the global space and access it via index such as foo[i] in a loop in each of my processes which seems to put it into a shared memory area/CoW semantics with the processes thus it does not explode the memory usage (Stays at ~3 Gigabytes)

  4. However according to the performance benchmarks/tracing, it seems like the large majority of the application time is now spent in "acquire" mode...

So I was wondering if there is any way i can somehow turn this list into some sort of lockfree/read only so that I can do away with part of the acquire step to help speed up access even more.

Edit 1: Here's the top few line output of the profiling of the app

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   65 2450.903   37.706 2450.903   37.706 {built-in method acquire}
39320    0.481    0.000    0.481    0.000 {method 'read' of 'file' objects}
  600    0.298    0.000    0.298    0.000 {posix.waitpid}
   48    0.271    0.006    0.271    0.006 {posix.fork}

Edit 2: Here's a example of the list structure:

# Sample code for a rough idea of how the list is constructed
sim = []
for root, dirs, files in os.walk(rootdir):
    path = os.path.join(root, filename)
    image= Image.open(path)
    np_array = np.asarray(image)
    sim.append( (np_array, path) )

# Roughly it would look something like say this below
sim = List( (np.array([[1, 2, 3], [4, 5, 6]], np.int32), "/foobar/com/what.something") )

Then henceforth the SIM list is to be read only.

like image 968
Pharaun Avatar asked Jan 20 '11 17:01

Pharaun


1 Answers

The multiprocessing module provides exactly what you need: a shared array with optional locking, namely the multiprocessing.Array class. Pass lock=False to the constructor to disable locking.

Edit (taking into account your update): Things are actually considerably more involved than I initially expected. The data of all elements in your list needs to be created in shared memory. Whether you put the list itself (i.e. the pointers to the actual data) in shared memory, does not matter too much because this should be a small compared to the data of all files. To store the file data in shared memory, use

shared_data = multiprocessing.sharedctypes.RawArray("c", data)

where data is the data you read from the file. To use this as a NumPy array in one of the processes, use

numpy.frombuffer(shared_data, dtype="c")

which will create a NumPy array view for the shared data. Similarly, to put the path name into shared memory, use

shared_path = multiprocessing.sharedctypes.RawArray("c", path)

where path is an ordinary Python string. In your processes, you can access this as a Python string by using shared_path.raw. Now append (shared_data, shared_path) to your list. The list will get copied to the other processes, but the actual data won't.

I initially meant to use an multiprocessing.Array for the actual list. This would be perfectly possible and would ensure that also the list itself (i.e. the pointers to the data) is in shared memory. Now I think this is not that important at all, as long as the actual data is shared.

like image 165
Sven Marnach Avatar answered Nov 17 '22 07:11

Sven Marnach