Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading nested .h5 group into numpy array

I received this .h5 file from a friend and I need to use the data in it for some work. All the data is numerical. This the first time I work with these kind of files. I found many questions and answers here about reading these files but I couldn't find a way to get to lower level of the groups or folders the file contains. The file contains two main folders, i.e. X and Y X contains a folder named 0 which contains two folders named A and B. Y contains ten folders named 1-10. The data I want to read is in A,B,1,2,..,10 for instance I start with

f = h5py.File(filename, 'r')
f.keys()

Now f returns [u'X', u'Y'] The two main folders

Then I try to read X and Y using read_direct but I get the error

AttributeError: 'Group' object has no attribute 'read_direct'

I try to create an object for X and Y as follows

obj1 = f['X']

obj2 = f['Y']

Then if I use command like

obj1.shape
obj1.dtype 

I get an error

AttributeError: 'Group' object has no attribute 'shape'

I can see that these command don't work because I use then on X and Y which are folders contains no data but other folders.

So my question is how to get down to the folders named A, B,1-10 to read the data

I couldn't find a way to do that even in the documentation http://docs.h5py.org/en/latest/quick.html

like image 579
Mazin Avatar asked Jul 26 '18 22:07

Mazin


1 Answers

You need to traverse down your HDF5 hierarchy until you reach a dataset. Groups do not have a shape or type, datasets do.

Assuming you do not know your hierarchy structure in advance, you can use a recursive algorithm to yield, via an iterator, full paths to all available datasets in the form group1/group2/.../dataset. Below is an example.

import h5py

def traverse_datasets(hdf_file):

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = f'{prefix}/{key}'
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    for path, _ in h5py_dataset_iterator(hdf_file):
        yield path

You can, for example, iterate all dataset paths and output attributes which interest you:

with h5py.File(filename, 'r') as f:
    for dset in traverse_datasets(f):
        print('Path:', dset)
        print('Shape:', f[dset].shape)
        print('Data type:', f[dset].dtype)

Remember that, by default, arrays in HDF5 are not read entirely in memory. You can read into memory via arr = f[dset][:], where dset is the full path.

like image 119
jpp Avatar answered Oct 13 '22 03:10

jpp