I'm trying to understand the limits of HDF5 concurrency.
There are two builds of HDF5: parallel HDF5 and default. The parallel version is is currently supplied in Ubuntu, and the default in Anaconda (judged by --enable-parallel
flag).
I know that parallel writes to the same file are impossible. However, I don't fully understand to what extend the following actions are possible with default or with parallel build:
Also, are there any reasons anaconda does not have --enable-parallel flag on by default? (https://github.com/conda/conda-recipes/blob/master/hdf5/build.sh)
The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer.
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
Parallel HDF5 (PHDF5) is the parallel version of the HDF5 library. It utilizes MPI to perform parallel HDF5 operations. For example, when a file is opened with an MPI communicator, all the processes within the communicator can perform various operations on the file.
AFAICT, there are three ways to build libhdf5:
conda
recipe you posted)That is, the --enable-threadsafe
and --enable-parallel
flags are mutually exclusive (https://www.hdfgroup.org/hdf5-quest.html#p5thread).
As for concurrent reads on one or even multiple files, the answer is that you need thread safety (https://www.hdfgroup.org/hdf5-quest.html#tsafe):
Concurrent access to one or more HDF5 file(s) from multiple threads in the same process will not work with a non-thread-safe build of the HDF5 library. The pre-built binaries that are available for download are not thread-safe.
Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists.
Edit: The links above no longer work because the HDF Group reorganised their website. There is a page Questions about thread-safety and concurrent access in the HDF5 Knowledge Base that contains some useful information.
While only concurrent threads on a single process are mentioned in the passage, it appears to apply equally to forked subprocesses: see this h5py multiprocessing example.
Now, for parallel access, you might want to use "Parallel HDF5" but those features requires using MPI. This pattern is supported by h5py but is more complicated and esoteric, and probably even less portable than thread-safe mode. More importantly, trying to naively do concurrent reads with a parallel build of libhdf5 will lead to unexpected results because the library isn't thread-safe.
Besides efficiency, one limitation of the thread-safe build flag is lack of Windows support (https://www.hdfgroup.org/hdf5-quest.html#gconc):
The thread-safe version of HDF5 is currently not tested or supported on MS Windows platforms. A user was able to get this working on Windows 64-bit and contributed his Windows 64-bit Pthreads patches.
Getting weird corrupt results when reading (different!) files from Python is definitely unexpected and frustrating given how concurrent read access is one of the touted "features" of HDF5. Perhaps a better default recipe for conda would be to include --enable-threadsafe
on those platforms that support it, but I guess then you would end up with platform-specific behavior. Maybe there ought to be separate packages for the three build modes instead?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With