I have many hdf5 files each with a single dataset on them. I want to combine them into one dataset where the data is all in the same volume (each file is an image, I want one large timelapse image).
I wrote a python script to extract the data as a numpy array, store them, then try to write that to a new h5 file. However, this approach will not work because the combined data uses more than the 32 GB of RAM that I have.
I also tried using h5copy, the command line tool.
h5copy -i file1.h5 -o combined.h5 -s '/dataset' -d '/new_data/t1'
h5copy -i file2.h5 -o combined.h5 -s '/dataset' -d '/new_data/t2'
Which works, but it results in many datasets within the new file instead of having all of the datasets in series.
Although you can't explicitly append rows to an hdf5 dataset, you can use the maxshape keyword to your advantage when creating your dataset in a way that will allow you to 'resize' the dataset to accomodate new data. (See http://docs.h5py.org/en/latest/faq.html#appending-data-to-a-dataset)
Your code will end up looking something like this, assuming the number of columns for your dataset is always the same:
import h5py
output_file = h5py.File('your_output_file.h5', 'w')
#keep track of the total number of rows
total_rows = 0
for n, f in enumerate(file_list):
your_data = <get your data from f>
total_rows = total_rows + your_data.shape[0]
total_columns = your_data.shape[1]
if n == 0:
#first file; create the dummy dataset with no max shape
create_dataset = output_file.create_dataset("Name", (total_rows, total_columns), maxshape=(None, None))
#fill the first section of the dataset
create_dataset[:,:] = your_data
where_to_start_appending = total_rows
else:
#resize the dataset to accomodate the new data
create_dataset.resize(total_rows, axis=0)
create_dataset[where_to_start_appending:total_rows, :] = your_data
where_to_start_appending = total_rows
output_file.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With