Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a dataset of Null Terminated Fixed Length Strings with h5py

Tags:

python

h5py

I have an example in C++ that I'm trying to reproduce using h5py, but it is not working as expected. I'm getting null padded strings with h5py where I expect null terminated strings.

Here is my C++ driver...

main.cpp

#include <hdf5.h>

int main(void) {
    auto file = H5Fcreate("test-c.h5", H5F_ACC_TRUNC,
            H5P_DEFAULT, H5P_DEFAULT);
    char strings[5][64] = {
        "please work 0",
        "please work 1",
        "please work 2",
        "please work 3",
        "please work 4"};
    auto H5T_C_S1_64 = H5Tcopy (H5T_C_S1);
    H5Tset_size(H5T_C_S1_64, 64);
    hsize_t dims[1] = {5};
    auto dataspace = H5Screate_simple(1, dims, NULL);
    auto dataset = H5Dcreate(file, "test dataset", H5T_C_S1_64, dataspace,
            H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    H5Dwrite (dataset, H5T_C_S1_64, H5S_ALL, H5S_ALL, H5P_DEFAULT, strings);
    H5Dclose(dataset);
    H5Sclose(dataspace);
    H5Tclose(H5T_C_S1_64);
    H5Fclose(file);
    return 0;
}

Which I build with the following SCons script.

SConstruct

env = Environment()
env.Append(LIBS=['hdf5'],
           CPPFLAGS=['-std=c++11'])
env.Program('writeh5', 'main.cpp')

And here is my python script that I'm trying to get to write out the same hdf5 file with.

main.py

import h5py

hdf5 = h5py.File('test-p.h5', 'w')
H5T_C_S1_64 = h5py.h5t.C_S1.copy()
H5T_C_S1_64.set_size(64)
print "Null Terminated String: %s" % (
    H5T_C_S1_64.get_strpad() == h5py.h5t.STR_NULLTERM)
dataset = hdf5.create_dataset('test dataset', (5,),
                              data=['please work %s' % n for n in xrange(5)],
                              dtype=H5T_C_S1_64)
hdf5.close()

I'm using python v2.7.11, and I have tried this with h5py v2.5.0, and v2.6.0, with the following same results.

>> python --version
Python 2.7.11

>> python -c "import h5py; print h5py.version.version"
2.5.0

>> tree
.
├── main.cpp
├── main.py
└── SConstruct

0 directories, 3 files

>> scons
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
g++ -o main.o -c -std=c++11 main.cpp
g++ -o writeh5 main.o -lhdf5
scons: done building targets.

>> tree
.
├── main.cpp
├── main.o
├── main.py
├── SConstruct
└── writeh5

0 directories, 5 files

>> ./writeh5 

>> tree
.
├── main.cpp
├── main.o
├── main.py
├── SConstruct
├── test-c.h5
└── writeh5

0 directories, 6 files

>> python main.py
Null Terminated String: True

>> tree
.
├── main.cpp
├── main.o
├── main.py
├── SConstruct
├── test-c.h5
├── test-p.h5
└── writeh5

0 directories, 7 files

>> h5dump test-c.h5 
HDF5 "test-c.h5" {
GROUP "/" {
   DATASET "test dataset" {
      DATATYPE  H5T_STRING {
         STRSIZE 64;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      DATA {
      (0): "please work 0", "please work 1", "please work 2",
      (3): "please work 3", "please work 4"
      }
   }
}
}

>> h5dump test-p.h5
HDF5 "test-p.h5" {
GROUP "/" {
   DATASET "test dataset" {
      DATATYPE  H5T_STRING {
         STRSIZE 64;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      DATA {
      (0): "please work 0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
      (1): "please work 1\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
      (2): "please work 2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
      (3): "please work 3\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
      (4): "please work 4\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
      }
   }
}
}

As you can see from the output above, I am still ending up with null padded fixed length strings when using h5py, even though I am specifying I want null terminated fixed length strings.

So how do I modify my python script to end up with null terminated fixed length strings in the dataset? If it is a bug in h5py, are there any workarounds?

Thanks in advance for any help.

like image 693
Kenneth E. Bellock Avatar asked Oct 30 '22 00:10

Kenneth E. Bellock


1 Answers

Edit: found solution that works with 'vanilla' h5py below

In the h5py source there is the following cython code:

cdef TypeStringID _c_string(dtype dt):
    # Strings (fixed-length)
    cdef hid_t tid

    tid = H5Tcopy(H5T_C_S1)
    H5Tset_size(tid, dt.itemsize)
    H5Tset_strpad(tid, H5T_STR_NULLPAD)
    return TypeStringID(tid)

I'm not entirely sure what it does. However, after commenting out the line that says H5Tset_strpad(tid, H5T_STR_NULLPAD) and compiling the library, the problem seems solved while python2 setup.py test does not report any unexpected failed tests. It is the only function that references H5T_C_S1 not in the context of variable length strings. Looks somewhat like a bug.

So, one (hacky) way to do it would be executing the following commands in the directory of your script.

$ https://github.com/h5py/h5py h5py-source
$ mkdir fake-root
$ sed -i '/H5Tset_strpad(tid, H5T_STR_NULLPAD)/d' h5py-source/h5py/h5t.pyx
$ (cd h5py-source; python2 setup.py install --root fake-root)
$ mv fake-root/usr/lib/python2.7/site-packages/h5py .

Then, when importing h5py, the h5py in your local directory will override the system-wide installed version. You'd probably be better of using installation in user site-packages, virtual environments or opening an issue.

Be warned that applying this fix might break things in unexpected ways (I have never used hdf5 before and have no idea what the impact of this might be). The real solution probably involves loading strpad from dt.

Edit

I did some more research:

The documentation does only list 3 kinds of supported strings, zero padded fixed length strings and two different kinds of variable length strings. No mention of zero terminated strings. So it looks like the h5py public api does not support null terminated strings (even though null c strings are mentioned in the code).

Next, the dtype argument should be a valid numpy dtype. There is no explicit mention of support for H5T. However, somehow the H5T type still gets interpreted as string. Changing the padding did not change any attribute of the dtype received in TypeStringID.

The conversion of a numpy dtype to h5t type happend in dataset.py:736:

if isinstance(dtype, Datatype):                                                                              
    # Named types are used as-is                                            
    tid = dtype.id                                                          
    dtype = tid.dtype  # Following code needs this                          
else:                                                                       
    # Validate dtype                                                        
    if dtype is None and data is None:                                      
        dtype = numpy.dtype("=f4")                                          
    elif dtype is None and data is not None:                                
        dtype = data.dtype                                                  
    else:                                                                   
        dtype = numpy.dtype(dtype)                                                                        
    tid = h5t.py_create(dtype, logical=1)

Where numpy.dtype(H5T_C_S1) gives a dtype with kind='S'. Next, the call to h5t.py_create(dtype, logical=1) dispatches this to the _c_string(dt) from above. Thus the fix would indeed break things, because all fixed length strings would end up being null terminated.

However, this also shows a way better solution. By constructing a dtype from a H5T tid we can bypass the numpy.dtype conversion.

This code works correctly with a vanilla h5py install:

import h5py

hdf5 = h5py.File('test-p.h5', 'w')

tid = h5py.h5t.C_S1.copy()
tid.set_size(64)
H5T_C_S1_64 = h5py.Datatype(tid)

dataset = hdf5.create_dataset('test dataset', (5,),
                              data=['please work %s' % n for n in range(5)],
                              dtype=H5T_C_S1_64)
hdf5.close()

This also allows you to use any padding scheme you want to. Howerver, I could not find documentation for it, so the api might change in the future.

like image 124
Lennart Avatar answered Nov 15 '22 05:11

Lennart