HDF5 possible data corruption or loss?

Tags:

On wikipedia one can read the following criticism about HDF5:

Criticism of HDF5 follows from its monolithic design and lengthy specification. Though a 150-page open standard, there is only a single C implementation of HDF5, meaning all bindings share its bugs and performance issues. Compounded with the lack of journaling, documented bugs in the current stable release are capable of corrupting entire HDF5 databases. Although 1.10-alpha adds journaling, it is backwards-incompatible with previous versions. HDF5 also does not support UTF-8 well, necessitating ASCII in most places. Furthermore even in the latest draft, array data can never be deleted.

I am wondering if this is just applying to the C implementation of HDF5 or if this is a general flaw of HDF5?

I am doing scientific experiments which sometimes generate Gigabytes of data and in all cases at least several hundred Megabytes of data. Obviously data loss and especially corruption would be a huge disadvantage for me.

My scripts always have a Python API, hence I am using h5py (version 2.5.0).

So, is this criticism relevant to me and should I be concerned about corrupted data?

603

asked Mar 07 '16 06:03

daniel451

1 Answers

Declaration up front: I help maintain h5py, so I probably have a bias etc.

The wikipedia page has changed since the question was posted, here's what I see:

Criticism

Criticism of HDF5 follows from its monolithic design and lengthy specification.

Though a 150-page open standard, the only other C implementation of HDF5 is just a HDF5 reader.

HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.

Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).

I'd say that pretty much sums up the problems with HDF5, it's complex (but people need this complexity, see the virtual dataset support), it's got a long history with backwards compatibly as it's focus, and it's not really designed to allow for massive changes in files. It's also not the best on Windows (due to how it deals with filenames).

I picked HDF5 for my research because of the available options, it had decent metadata support (HDF5 at least allows UTF-8, formats like FITS don't even have that), support for multidimensional arrays (which formats like Protocol Buffers don't really support), and it supports more than just 64 bit floats (which is very rare).

I can't comment about known bugs, but I have seen corruption (this happened when I was writing to a file and linux OOM'd my script). However, this shouldn't be a concern as long as you have proper data hygiene practices (as mentioned in the hackernews link), which in your case would be to not continuously write to the same file, but for each run create a new file. You should also not modify the file, instead any data reduction should produce new files, and you should always backup the originals.

Finally, it is worth pointing out there are alternatives to HDF5, depending on what exactly your requirements are: SQL databases may fit you needs better (and sqlite comes with Python by default, so it's easy to experiment with), as could a simple csv file. I would recommend against custom/non-portable formats (e.g. pickle and similar), as they're neither more robust than HDF5, and more complex than a csv file.

150

answered Sep 20 '22 12:09

James Tocknell

Related questions
                            
                                Overriding len in __init__.py - python
                            
                                Post-process classifier output in scikit learn Pipeline
                            
                                Insert a list of dictionaries into an SQL table using python
                            
                                Access module 'sys' without using import machinery
                            
                                Synchronizing and Resampling two timeseries with non-uniform millisecond intraday data
                            
                                Python - generate array of specific autocorrelation
                            
                                pypeg2 - can this expression be parsed using peg grammar?
                            
                                Running certain steps once before a scenario outline - Python Behave
                            
                                How do I run a python script with Tensorflow running in a Docker on Windows?
                            
                                django1.8- how to append information manually when uploading Excel and inserting into database
                            
                                Python - Converting a string with escape characters to json
                            
                                Python Generator "chain" in a for loop
                            
                                How to add multiple lines at bottom (footer) of PDF?
                            
                                Removing a Django migration that depends on custom field from an old module
                            
                                When/why should I use struct library in python to pack and unpack?
                            
                                Why does this giant (non-sparse) numpy matrix fit in RAM
                            
                                Adding + sign to exponent in matplotlib axes
                            
                                Why does my LRU cache miss with the same argument?
                            
                                NetworkX shuffles nodes order
                            
                                How to design an async pipeline pattern in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HDF5 possible data corruption or loss?

Tags:

python

hdf5

python-2.7

h5py

hdf

daniel451

People also ask

1 Answers

James Tocknell

Recent Activity

Donate For Us