Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining hdf5 files

Tags:

python

hdf5

h5py

I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).

One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.

Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.

I am using python/h5py.

like image 813
Bitwise Avatar asked Aug 28 '13 15:08

Bitwise


People also ask

Why are HDF5 files so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

What is an HDF5 group?

Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries. In this case the “keys” are the names of group members, and the “values” are the members themselves ( Group and Dataset ) objects.

What is parallel HDF5?

Parallel HDF5 (PHDF5) is the parallel version of the HDF5 library. It utilizes MPI to perform parallel HDF5 operations. For example, when a file is opened with an MPI communicator, all the processes within the communicator can perform various operations on the file.


1 Answers

This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:

External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:

Here's how to do it in h5py:

myfile = h5py.File('foo.hdf5','a') myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource") 

Be careful: when opening myfile, you should open it with 'a' if it is an existing file. If you open it with 'w', it will erase its contents.

This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5 would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5.

like image 56
Yossarian Avatar answered Sep 18 '22 18:09

Yossarian