I use the Python package h5py (version 2.5.0) to access my hdf5 files.
I want to traverse the content of a file and do something with every dataset.
Using the visit
method:
import h5py
def print_it(name):
dset = f[name]
print(dset)
print(type(dset))
with h5py.File('test.hdf5', 'r') as f:
f.visit(print_it)
for a test file I obtain:
<HDF5 group "/x" (1 members)>
<class 'h5py._hl.group.Group'>
<HDF5 dataset "y": shape (100, 100, 100), type "<f8">
<class 'h5py._hl.dataset.Dataset'>
which tells me that there is a dataset and a group in the file. However there is no obvious way except for using type()
to differentiate between the datasets and the groups. The h5py documentation unfortunately does not say anything about this topic. They always assume that you know beforehand what are the groups and what are the datasets, for example because they created the datasets themselves.
I would like to have something like:
f = h5py.File(..)
for key in f.keys():
x = f[key]
print(x.is_group(), x.is_dataset()) # does not exist
How can I differentiate between groups and datasets when reading an unknown hdf5 file in Python with h5py? How can I get a list of all datasets, of all groups, of all links?
Within one HDF5 file, you can store a similar set of data organized in the same way that you might organize files and folders on your computer. However in a HDF5 file, what we call "directories" or "folders" on our computers, are called groups and what we call files on our computer are called datasets .
HDF5 is a specification and format for creating hierarchical data from very large data sources. In HDF5 the data is organized in a file. The file object acts as the / (root) group of the hierarchy. Similar to the UNIX file system, in HDF5 the datasets and their groups are organized as an inverted tree.
Open a HDF5/H5 file in HDFView hdf5 file on your computer. Open this file in HDFView. If you click on the name of the HDF5 file in the left hand window of HDFView, you can view metadata for the file. This will be located in the bottom window of the application.
An HDF5 dataset is an object composed of a collection of data elements, or raw data, and metadata that stores a description of the data elements, data layout, and all other information necessary to write, read, and interpret the stored data.
Unfortunately, there is no builtin way in the h5py api to check this, but you can simply check the type of the item with is_dataset = isinstance(item, h5py.Dataset)
.
To list all the content of the file (except the file's attributes though) you can use Group.visititems
with a callable which takes the name and instance of a item.
While the answers by Gall and James Smith are indicating the solution in general, the traversal through the hierachical HDF structure and filtering of all datasets still needed to be done. I did it using yield from
which is available in Python 3.3+ which works quite nicely and present it here.
import h5py
def h5py_dataset_iterator(g, prefix=''):
for key, item in g.items():
path = '{}/{}'.format(prefix, key)
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)
with h5py.File('test.hdf5', 'r') as f:
for (path, dset) in h5py_dataset_iterator(f):
print(path, dset)
For example, if you want to print the structure of a HDF5
file you can use the following code:
def h5printR(item, leading = ''):
for key in item:
if isinstance(item[key], h5py.Dataset):
print(leading + key + ': ' + str(item[key].shape))
else:
print(leading + key)
h5printR(item[key], leading + ' ')
# Print structure of a `.h5` file
def h5print(filename):
with h5py.File(filename, 'r') as h:
print(filename)
h5printR(h, ' ')
>>> h5print('/path/to/file.h5')
file.h5
test
repeats
cell01: (2, 300)
cell02: (2, 300)
cell03: (2, 300)
cell04: (2, 300)
cell05: (2, 300)
response
firing_rate_10ms: (28, 30011)
stimulus: (300, 50, 50)
time: (300,)
Because h5py uses python dictionaries as its method-of-choice for interaction, you need to use the "values()" function to actually access the items. So you may be able to use list filters:
datasets = [item for item in f["Data"].values() if isinstance(item, h5py.Dataset)]
Doing this recursively should be simple enough.
I prefer this solution. It finds the list of all objects in the hdf5 file "h5file", then sorts them based on class, similar to what has been mentioned before but not in such a succinct way:
import h5py
fh5 = h5py.File(h5file,'r')
fh5.visit(all_h5_objs.append)
all_groups = [ obj for obj in all_h5_objs if isinstance(fh5[obj],h5py.Group) ]
all_datasets = [ obj for obj in all_h5_objs if isinstance(fh5[obj],h5py.Dataset) ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With