I'm running into trouble reading a hdf5 matlab 7.3 file with Python. I'm using h5py 2.0.1.
I can read all the matrices that are stored in the file, but I can not read a list of strings.
h5py shows the strings as a dataset of shape (1, 894) with type |04.
This data set contains object references, which I tried to dereference using the h5file[obj_ref]
syntax.
This yields something like dataset "FFb": shape (4, 1) type "<u2"
.
I interpreted that as a array of chars of length four. Which seems to be the ASCII representation of the string.
Is there an easy way to get the strings out?
Is there any package providing matlab to python hdf5 support?
To use HDF5, numpy needs to be imported. One important feature is that it can attach metaset to every data in the file thus provides powerful searching and accessing. Let's get started with installing HDF5 to the computer. As HDF5 works on numpy, we would need numpy installed in our machine too.
All strings in HDF5 hold encoded text.You can't store arbitrary binary data in HDF5 strings.
MATLAB supports reading and writing HDF5 data sets using dynamically loaded filters.
I assume you mean it is a cell array of strings in MATLAB? This output looks normal: the dataset is an array of objects (|O4
is the NumPy object datatype). Each object is an array of 2-byte integers (<u2
is the NumPy little-endian unsigned 2-byte integer datatype). h5py has no way of knowing that the dataset is a cell array of strings; it could just as well be a cell array of arbitrary 16-bit integers.
The easiest way to get the strings out would be to use an iterator using unichr to convert the characters, like this:
strlist = [u''.join(unichr(c) for c in h5file[obj_ref]) for obj_ref in dataset])
What this does is iterate over the dataset (for obj_ref in dataset
) to create a new list. For each object reference, it dereferences the object (h5file[obj_ref]
) to get an array of integers. It converts each integer into a character (unichr(c)
) and joins those characters all together into a Unicode string (u''.join()
).
Note that this produces a list of unicode strings. If you are absolutely sure that every string contains only ASCII characters, you can replace u''
by ''
and unichr
by chr
.
Caveat: I don't have h5py; this post is based on my experiences with MATLAB and NumPy. You may need to adjust the syntax or iteration order to suite your dataset.
You can get the original Matlab class name of Group
and Dataset
objects by
dataset.attrs['MATLAB_class']
if dataset
contains a string, it will return b'char'
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With