Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'/' in names in HDF5 files confusion

I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so:

What have I not understood here?


The gory details

I have the following data in a HDF5 file:

   [...]
   DATASET "log" {
      DATATYPE  H5T_COMPOUND {
         H5T_COMPOUND {
            H5T_STD_U32LE "sec";
            H5T_STD_U32LE "usec";
         } "time";
         H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure";
         [...]

This was created via the C++ API. The h5check utility says the file is valid.

Note that CIF/align/aft_port_end/extend_pressure is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else.

Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an

 >>> import pandas as pd
 >>> store = pd.HDFStore('data/XXX-20150423-071618.h5')
 >>> store
/home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``::

  the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure'

The leaf will become an ``UnImplemented`` node. 

I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py...

Using h5py to read the file, I get what I want:

>>> f['/log'].dtype
>>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI
F/align/aft_port_end/extend_pressure', '<f4')[...]

Which is more or less what I set out with.

Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused.


Clearly, I could write a simple wrapper something like this:

>>> import matplotlib.pyplot as plt
>>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure'])
>>> silly.plot()
>>> plt.show()

to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...

like image 861
Sardathrion - against SE abuse Avatar asked May 06 '15 08:05

Sardathrion - against SE abuse


People also ask

How are HDF5 files structured?

HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets. HDF5 group: a grouping structure containing instances of zero or more groups or datasets, together with supporting metadata. HDF5 dataset: a multidimensional array of data elements, together with supporting metadata.

What is .H5 format?

An H5 is one of the Hierarchical Data Formats (HDF) used to store large amount of data. It is used to store large amount of data in the form of multidimensional arrays. The format is primarily used to store scientific data that is well-organized for quick retrieval and analysis.

Is HDF5 faster than csv?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better.


Video Answer


2 Answers

I've browsed a bit through the h5check source and I can't find any place where it tests if a name contains a slash. You can examine the error messages it can produce with:

grep error_push h5checker.c -A1

The links you provided clearly state that slashes are not allowed in object names. So yes, I think you've made a file that is illegal but passes h5check. The tool seems to focus more on the binary data layout. The closest related check I can find is a guard against duplicate names.

In my opinion that's all there is to it. The fact that h5py and other libraries somehow are able to create or read this illegal file is irrelevant. The spec says "don't put slashes in object names", so you don't. End of story.

If you're not convinced, think of it like this: if you somehow managed to create a regular file with a slash in its file name, what would happen? Most programs assume that file names contains no slashes and thus that they are able to partition a directory path by splitting it at the slash characters. Your file would break this behavior and so introduce many subtle (and not so subtle) bugs. Users would complain, programmers would hate you, system administrators would curse you.

Likewise it's safe to assume that, next to PyTables, many other libraries and programs will not be able to handle slashes in variable names. The nice thing about HDF is that so many tools exist for it, and by using slashes you throw away that advantage. You may think that this this is not important, perhaps your HDF-5 files are for internal use only. However, the situation may change in 5 years, as situations tend to do.

Just bite the bullet and replace '/' with '|' before writing your variables to HDF5. Replace them back when you read them. The time you lose by implementing this, you'll win back x-fold (for x>1) by avoiding future bugs and user complaints.

Sorry about the rant but I hope to have convinced you.

like image 143
titusjan Avatar answered Sep 20 '22 11:09

titusjan


Could you use h5py to read thru all your files and rewrite them without the offending characters, so that pytables can read them?

If it is outside the spec, I assume what you are experiencing is just that some implementations handle it and others do not...

like image 30
tmthydvnprt Avatar answered Sep 23 '22 11:09

tmthydvnprt