I am using h5py to store intermediate data from numerical work in an HDF5 file. I have the project under version control, but this doesn't work well with the HDF5 files because every time a script is re-run which generates a HDF5 file, the binary file changes even if the data within does not.
Here is a small example to illustrate this:
In [1]: import h5py, numpy as np
In [2]: A = np.arange(5)
In [3]: f = h5py.File('test.h5', 'w'); f['A'] = A; f.close()
In [4]: !md5sum test.h5
7d27c258d94ed5d06736f6d2ba7c9433 test.h5
In [5]: f = h5py.File('test.h5', 'w'); f['A'] = A; f.close()
In [6]: !md5sum test.h5
c1db5806f1393f2095c88dbb7efeb7d3 test.h5
In [7]: # the file has changed but still contains the same data!
I have looked in the HDF5 file format documents and at the h5py documentation but haven't found anything which helps me with this. My questions are:
Why does the file change even though I'm saving the same data?
How can I stop it changing, so version control only sees a new version of the file when the actual numerical content has changed?
Thanks
The HDF5 file uses both an abstract data model as well as an abstract storage model. What this means is that how a file is stored on disk may be (and usually is) completely different to how it is represented in your program. It's possible to store exactly the same data in more than one way, and for this not to be apparent to your program.
The HDF5 file format storage specification allows for several timestamps in the data object headers. These are not stored as attributes, and so aren't usually accessible by the high level APIs. It's possible to turn off writing these timestamps using the low level HDF5 APIs, but it's not clear if the relevant features are in h5py. This github issue appears to be exactly what you want, but unfortunately it is still open.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With