On wikipedia one can read the following criticism about HDF5:
Criticism of HDF5 follows from its monolithic design and lengthy specification. Though a 150-page open standard, there is only a single C implementation of HDF5, meaning all bindings share its bugs and performance issues. Compounded with the lack of journaling, documented bugs in the current stable release are capable of corrupting entire HDF5 databases. Although 1.10-alpha adds journaling, it is backwards-incompatible with previous versions. HDF5 also does not support UTF-8 well, necessitating ASCII in most places. Furthermore even in the latest draft, array data can never be deleted.
I am wondering if this is just applying to the C implementation of HDF5 or if this is a general flaw of HDF5?
I am doing scientific experiments which sometimes generate Gigabytes of data and in all cases at least several hundred Megabytes of data. Obviously data loss and especially corruption would be a huge disadvantage for me.
My scripts always have a Python API, hence I am using h5py
(version 2.5.0).
So, is this criticism relevant to me and should I be concerned about corrupted data?
Corruption risks Corruption may happen if your software crashes while it's accessing an HDF5 file. Once a file is corrupted, all of your data is basically lost forever. This is a major drawback of storing a lot of data in a single file, which is what HDF5 is designed for.
View Metadata of an HDF5 Object To view the metadata of a data object, Right click on the object and then select 'Show Properties'. A window will open and display metadata information such as name, type, attributes, data type, and data space.
The HDF5 format is a compressed format. The size of all data contained within HDF5 is optimized which makes the overall file size smaller.
Declaration up front: I help maintain h5py, so I probably have a bias etc.
The wikipedia page has changed since the question was posted, here's what I see:
Criticism
Criticism of HDF5 follows from its monolithic design and lengthy specification.
- Though a 150-page open standard, the only other C implementation of HDF5 is just a HDF5 reader.
- HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.
- Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).
I'd say that pretty much sums up the problems with HDF5, it's complex (but people need this complexity, see the virtual dataset support), it's got a long history with backwards compatibly as it's focus, and it's not really designed to allow for massive changes in files. It's also not the best on Windows (due to how it deals with filenames).
I picked HDF5 for my research because of the available options, it had decent metadata support (HDF5 at least allows UTF-8, formats like FITS don't even have that), support for multidimensional arrays (which formats like Protocol Buffers don't really support), and it supports more than just 64 bit floats (which is very rare).
I can't comment about known bugs, but I have seen corruption (this happened when I was writing to a file and linux OOM'd my script). However, this shouldn't be a concern as long as you have proper data hygiene practices (as mentioned in the hackernews link), which in your case would be to not continuously write to the same file, but for each run create a new file. You should also not modify the file, instead any data reduction should produce new files, and you should always backup the originals.
Finally, it is worth pointing out there are alternatives to HDF5, depending on what exactly your requirements are: SQL databases may fit you needs better (and sqlite comes with Python by default, so it's easy to experiment with), as could a simple csv file. I would recommend against custom/non-portable formats (e.g. pickle and similar), as they're neither more robust than HDF5, and more complex than a csv file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With