I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).
One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.
Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.
I am using python/h5py.
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries. In this case the “keys” are the names of group members, and the “values” are the members themselves ( Group and Dataset ) objects.
Parallel HDF5 (PHDF5) is the parallel version of the HDF5 library. It utilizes MPI to perform parallel HDF5 operations. For example, when a file is opened with an MPI communicator, all the processes within the communicator can perform various operations on the file.
This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:
External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:
Here's how to do it in h5py:
myfile = h5py.File('foo.hdf5','a') myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")
Be careful: when opening myfile
, you should open it with 'a'
if it is an existing file. If you open it with 'w'
, it will erase its contents.
This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5
would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With