Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Corrupt files when creating HDF5 files without closing them (h5py)

I am using h5py to store experiment data in an HDF5 container.

In an interactive session I open the file using:

measurement_data = h5py.File('example.hdf5', 'a')

Then I write data to the file using some self-written functions (can be many GB of data from a couple of days experiment). At the end of the experiment I usually would close the file using

measurement_data.close()

Unfortunately, from time to time it happens, that the interactive session ends without me explicitly closing the file (accidentally killing the session, power outage, crash of OS due to some other software). This always results in a corrupt file and loss of the complete data. When I try to open it, I get the error:

OSError: Unable to open file (File signature not found)

I also cannot open the file in HDFview, or any other software I tried.

  1. Is there a way to avoid a corrupt file even if it is not closed explicitly? I've read about using the with statement here, but I'm not sure if this would help, when the session unexpectedly ends.
  2. Can I restore the data in the corrupt files in some way? Is there a repair program available?

Always opening and closing the file for every write access sounds pretty unfavorable to me, because I am continuously writing data from many different functions and threads. So I'd be more happy with a different solution.

like image 359
erik Avatar asked Jul 08 '15 08:07

erik


People also ask

Can HDF5 store strings?

Encodings. HDF5 supports two string encodings: ASCII and UTF-8.

What is a h5py file?

The h5py package is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

How do I check my HDF5 data?

View Metadata of an HDF5 Object To view the metadata of a data object, Right click on the object and then select 'Show Properties'. A window will open and display metadata information such as name, type, attributes, data type, and data space.


1 Answers

The corruption problem is known to the HDF5 designers. They are working on fixing this in version 1.10 by adding journalling. In the mean time you can call flush() periodically to make sure your writes have been flushed, which should minimise some of the damage. You can also try to use external links which will allow you to store pieces of data in separate files but link them together into one structure when you read them.

like image 174
chthonicdaemon Avatar answered Sep 18 '22 16:09

chthonicdaemon