Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading hdf5 matlab strings into Python

I'm running into trouble reading a hdf5 matlab 7.3 file with Python. I'm using h5py 2.0.1.

I can read all the matrices that are stored in the file, but I can not read a list of strings. h5py shows the strings as a dataset of shape (1, 894) with type |04. This data set contains object references, which I tried to dereference using the h5file[obj_ref] syntax.

This yields something like dataset "FFb": shape (4, 1) type "<u2". I interpreted that as a array of chars of length four. Which seems to be the ASCII representation of the string.

Is there an easy way to get the strings out?

Is there any package providing matlab to python hdf5 support?

like image 804
Andreas Mueller Avatar asked Aug 20 '12 10:08

Andreas Mueller


People also ask

How do I open an HDF5 file in Python?

To use HDF5, numpy needs to be imported. One important feature is that it can attach metaset to every data in the file thus provides powerful searching and accessing. Let's get started with installing HDF5 to the computer. As HDF5 works on numpy, we would need numpy installed in our machine too.

Can HDF5 store strings?

All strings in HDF5 hold encoded text.You can't store arbitrary binary data in HDF5 strings.

Can Matlab read HDF5?

MATLAB supports reading and writing HDF5 data sets using dynamically loaded filters.


2 Answers

I assume you mean it is a cell array of strings in MATLAB? This output looks normal: the dataset is an array of objects (|O4 is the NumPy object datatype). Each object is an array of 2-byte integers (<u2 is the NumPy little-endian unsigned 2-byte integer datatype). h5py has no way of knowing that the dataset is a cell array of strings; it could just as well be a cell array of arbitrary 16-bit integers.

The easiest way to get the strings out would be to use an iterator using unichr to convert the characters, like this:

strlist = [u''.join(unichr(c) for c in h5file[obj_ref]) for obj_ref in dataset])

What this does is iterate over the dataset (for obj_ref in dataset) to create a new list. For each object reference, it dereferences the object (h5file[obj_ref]) to get an array of integers. It converts each integer into a character (unichr(c)) and joins those characters all together into a Unicode string (u''.join()).

Note that this produces a list of unicode strings. If you are absolutely sure that every string contains only ASCII characters, you can replace u'' by '' and unichr by chr.

Caveat: I don't have h5py; this post is based on my experiences with MATLAB and NumPy. You may need to adjust the syntax or iteration order to suite your dataset.

like image 96
nneonneo Avatar answered Sep 18 '22 14:09

nneonneo


You can get the original Matlab class name of Group and Dataset objects by

dataset.attrs['MATLAB_class']

if dataset contains a string, it will return b'char'.

like image 32
eks2v Avatar answered Sep 18 '22 14:09

eks2v