I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so: What have I not understood here? <hr> The gory details I have the following data in a HDF5 file: <pre class="prettyprint"><code> [...] DATASET "log" { DATATYPE H5T_COMPOUND { H5T_COMPOUND { H5T_STD_U32LE "sec"; H5T_STD_U32LE "usec"; } "time"; H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure"; [...] </code></pre> This was created via the C++ API. The h5check utility says the file is valid. Note that <code>CIF/align/aft_port_end/extend_pressure</code> is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else. Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an <pre class="prettyprint"><code> >>> import pandas as pd >>> store = pd.HDFStore('data/XXX-20150423-071618.h5') >>> store /home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``:: the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure' The leaf will become an ``UnImplemented`` node. </code></pre> I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py... Using h5py to read the file, I get what I want: <pre class="prettyprint"><code>>>> f['/log'].dtype >>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI F/align/aft_port_end/extend_pressure', '<f4')[...] </code></pre> Which is more or less what I set out with. Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused. <hr> Clearly, I could write a simple wrapper something like this: <pre class="prettyprint"><code>>>> import matplotlib.pyplot as plt >>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure']) >>> silly.plot() >>> plt.show() </code></pre> to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...

Could you use <code>h5py</code> to read thru all your files and rewrite them without the offending characters, so that <code>pytables</code> can read them? If it is outside the spec, I assume what you are experiencing is just that some implementations handle it and others do not...

'/' in names in HDF5 files confusion

Tags:

python

pandas

hdf5

h5py

pytables

I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so:

What have I not understood here?

The gory details

I have the following data in a HDF5 file:

   [...]
   DATASET "log" {
      DATATYPE  H5T_COMPOUND {
         H5T_COMPOUND {
            H5T_STD_U32LE "sec";
            H5T_STD_U32LE "usec";
         } "time";
         H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure";
         [...]

This was created via the C++ API. The h5check utility says the file is valid.

Note that CIF/align/aft_port_end/extend_pressure is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else.

Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an

 >>> import pandas as pd
 >>> store = pd.HDFStore('data/XXX-20150423-071618.h5')
 >>> store
/home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``::

  the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure'

The leaf will become an ``UnImplemented`` node.

I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py...

Using h5py to read the file, I get what I want:

>>> f['/log'].dtype
>>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI
F/align/aft_port_end/extend_pressure', '<f4')[...]

Which is more or less what I set out with.

Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused.

Clearly, I could write a simple wrapper something like this:

>>> import matplotlib.pyplot as plt
>>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure'])
>>> silly.plot()
>>> plt.show()

to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...

861

asked May 06 '15 08:05

Sardathrion - against SE abuse

Video Answer

2 Answers

I've browsed a bit through the h5check source and I can't find any place where it tests if a name contains a slash. You can examine the error messages it can produce with:

grep error_push h5checker.c -A1

The links you provided clearly state that slashes are not allowed in object names. So yes, I think you've made a file that is illegal but passes h5check. The tool seems to focus more on the binary data layout. The closest related check I can find is a guard against duplicate names.

In my opinion that's all there is to it. The fact that h5py and other libraries somehow are able to create or read this illegal file is irrelevant. The spec says "don't put slashes in object names", so you don't. End of story.

If you're not convinced, think of it like this: if you somehow managed to create a regular file with a slash in its file name, what would happen? Most programs assume that file names contains no slashes and thus that they are able to partition a directory path by splitting it at the slash characters. Your file would break this behavior and so introduce many subtle (and not so subtle) bugs. Users would complain, programmers would hate you, system administrators would curse you.

Likewise it's safe to assume that, next to PyTables, many other libraries and programs will not be able to handle slashes in variable names. The nice thing about HDF is that so many tools exist for it, and by using slashes you throw away that advantage. You may think that this this is not important, perhaps your HDF-5 files are for internal use only. However, the situation may change in 5 years, as situations tend to do.

Just bite the bullet and replace '/' with '|' before writing your variables to HDF5. Replace them back when you read them. The time you lose by implementing this, you'll win back x-fold (for x>1) by avoiding future bugs and user complaints.

Sorry about the rant but I hope to have convinced you.

143

answered Sep 20 '22 11:09

titusjan

Could you use h5py to read thru all your files and rewrite them without the offending characters, so that pytables can read them?

If it is outside the spec, I assume what you are experiencing is just that some implementations handle it and others do not...

answered Sep 23 '22 11:09

tmthydvnprt

Related questions
                            
                                Why does python's built in binary search function run so much faster?
                            
                                Getting console.log output from Firefox with Selenium
                            
                                How to write unit tests for django-rest-framework api's?
                            
                                Show hex value for all bytes, even when ASCII characters are present
                            
                                Django NodeNotFoundError during migration
                            
                                How to not await in a loop with asyncio?
                            
                                convert python xgboost dMatrix to numpy ndarray or pandas DataFrame
                            
                                How to make Jupyter notebook use PYTHONPATH in system variables without hacking sys.path directly?
                            
                                Formatting on save moves import statment in VS-Code
                            
                                differences between users even after using Pipfile and Pipfile.lock with explicit versions
                            
                                Debugging a running python process
                            
                                How to find the source of increasing memory usage of a twisted server?
                            
                                How to incrementally train an nltk classifier
                            
                                Pivoting data and complex annotations in Django ORM
                            
                                Is it possible to embed PyPy into a .NET application?
                            
                                Implementation of functions with very basic scripting
                            
                                Python mutually dependent classes (circular dependencies)
                            
                                Is it possible to run commands in IPython with debugging?
                            
                                How do I check if a python function changed (in live code)?
                            
                                PIP install "error: package directory 'X' does not exist"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With