Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I mmap HDF5 data into multiple Python processes?

I am trying to load HDF5 data from a memory cache (memcached) or the network, and then query it (read only) from multiple Python processes, without making a separate copy of the whole data set. Intuitively I would like to mmap the image (as it would appear on disk) into the multiple processes, and then query it from Python.

I am finding this difficult to achieve, hence the question. Pointers/corrections appreciated.

Ideas I have explored so far

  • pytables - This looks the most promising, it supports rich interface for querying HDF5 data and it (unlike numpy) seems to work with the data without making a (process local) copy of the data. It even supports a method File.get_file_image() which would seem to get the file image. What I don't see how to construct a new File / FileNode from a memory image rather than a disk file.
  • h5py - Another way to get at HDF5 data, as with pytables, it seems to require a disk file. The option driver='core' looks promising, but I can't see how to provide an existing mmap'd region into it, rather than have it allocate its own.
  • numpy - A lower level approach, if I share my raw data via mmap, then I might be able to construct a numpy ndarray which can access this data. But the relevant constructor ndarray.__new__(buffer=...) says it will copy the data, and numpy views can only seem to be constructed from existing ndarrays, not raw buffers.
  • ctypes - Very lowest level approach (could possibly use multiprocessing's Value wrapper to help a little). If I use ctypes directly I can read my mmap'd data without issue, but I would lose all the structural information and help from numpy/pandas/pytables to query it.
  • Allocate disk space - I could just allocate a file, write out all the data, and then share it via pytables in all my processes. My understanding is this would be memory efficient because pytables doesn't copy (until required) and obviously the processes would share the OS disk cache of the underlying file image. My objection is it is ugly and brings disk I/O into what I would like to be a pure memory system.
like image 884
gxmw Avatar asked Oct 19 '22 20:10

gxmw


1 Answers

I think the situation should be updated now.

If a disk file is desirable, Numpy now has a standard, dedicated ndarray subclass: numpy.memmap

UPDATE: After looked into the implementation of multiprocessing.sharedctypes (CPython 3.6.2 shared memory block allocation code), I found that it always creates tmp files to be mmaped, so is not really a file-less solution.

If only pure RAM based sharing is expected, some one has demoed it with multiprocessing.RawArray: test of shared memory array / numpy integration

like image 182
Compl Yue Avatar answered Oct 31 '22 15:10

Compl Yue