I've been working with HDF5 files with C
and Matlab
, both using the same way for reading from and writing to datasets:
h5f
h5d
h5s
and so on...
But now I'm working with Python
, and with its h5py
library I see that it has two ways to manage HDF5: high-level and low-level interfaces. And with the former it takes less lines of code to get the information from a single variable of the file.
Is there any noticeable loss of performance when using the high-level interface?
For example when dealing with a file with many variables inside, and we must read just one of them.
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
An HDF5 file is a container for two kinds of objects: datasets , which are array-like collections of data, and groups , which are folder-like containers that hold datasets and other groups. The most fundamental thing to remember when using h5py is: Groups work like dictionaries, and datasets work like NumPy arrays.
Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.
Encodings. HDF5 supports two string encodings: ASCII and UTF-8.
High-level interfaces are generally going with a performance loss of some sort. After that, whether it is noticeable (worth being investigated) will depend on what you are doing exactly with your code.
Just start with the high-level interface. If the code is overall too slow, start profiling and move the bottlenecks down to the lower-level interface and see if it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With