When opening an HDF5 file with h5py
you can pass in a python file-like object. I have done so, where the file-like object is a custom implementation of my own network-based transport layer.
This works great, I can slice large HDF5 files over a high latency transport layer. However HDF5 appears to provide its own file locking functionality, so that if you open multiple files for read-only within the same process (threading model) it will still only run the operations, effectively, in series.
There are drivers in HDF5 that support parallel operations, such as h5py.File(f, driver='mpio')
, but this doesn't appear to apply to python file-like objects which use h5py.File(f, driver='fileobj')
.
The only solution I see is to use multiprocessing. However the scalability is very limited, you can only realistically open 10's of processes because of overhead. My transport layer uses asyncio and is capable of parallel operations on the scale of 1,000's or 10,000's, allowing me to build a longer queue of slow file-read operations which boost my total throughput.
I can achieve 1.5 GB/sec of large-file, random-seek, binary reads with my transport layer against a local S3 interface when I queue 10k IO ops in parallel (requiring 50GB of RAM to service the requests, an acceptable trade-off for the throughput).
Is there any way I can disable the h5py file locking when using
driver='fileobj'
?
Recent versions of NetCDF and HDF5 (HDF5 1.10. x and newer) use a file locking feature. This prevents data corruption in rare cases of single-writer/multiple-reader and multiple writer access patterns.
This package contains a single module, which implements a platform independent file lock in Python, which provides a simple way of inter-process communication: from filelock import Timeout, FileLock lock = FileLock("high_ground. txt. lock") with lock: with open("high_ground.
The h5py package is a Pythonic interface to the HDF5 binary data format. HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.
To use HDF5, numpy needs to be imported. One important feature is that it can attach metaset to every data in the file thus provides powerful searching and accessing. Let's get started with installing HDF5 to the computer. As HDF5 works on numpy, we would need numpy installed in our machine too.
You just need to set the value to FALSE for the environment variable HDF5_USE_FILE_LOCKING.
Examples are as follows:
In Linux or MacOS via Terminal: export HDF5_USE_FILE_LOCKING=FALSE
In Windows via Command Prompts (CMD): set HDF5_USE_FILE_LOCKING=FALSE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With