Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combined hdf5 files into single dataset

I have many hdf5 files each with a single dataset on them. I want to combine them into one dataset where the data is all in the same volume (each file is an image, I want one large timelapse image).

I wrote a python script to extract the data as a numpy array, store them, then try to write that to a new h5 file. However, this approach will not work because the combined data uses more than the 32 GB of RAM that I have.

I also tried using h5copy, the command line tool.

h5copy -i file1.h5 -o combined.h5 -s '/dataset' -d '/new_data/t1'
h5copy -i file2.h5 -o combined.h5 -s '/dataset' -d '/new_data/t2'

Which works, but it results in many datasets within the new file instead of having all of the datasets in series.

like image 600
not_a_computer_person Avatar asked Oct 31 '22 18:10

not_a_computer_person


1 Answers

Although you can't explicitly append rows to an hdf5 dataset, you can use the maxshape keyword to your advantage when creating your dataset in a way that will allow you to 'resize' the dataset to accomodate new data. (See http://docs.h5py.org/en/latest/faq.html#appending-data-to-a-dataset)

Your code will end up looking something like this, assuming the number of columns for your dataset is always the same:

import h5py

output_file = h5py.File('your_output_file.h5', 'w')

#keep track of the total number of rows
total_rows = 0

for n, f in enumerate(file_list):
  your_data = <get your data from f>
  total_rows = total_rows + your_data.shape[0]
  total_columns = your_data.shape[1]

  if n == 0:
    #first file; create the dummy dataset with no max shape
    create_dataset = output_file.create_dataset("Name", (total_rows, total_columns), maxshape=(None, None))
    #fill the first section of the dataset
    create_dataset[:,:] = your_data
    where_to_start_appending = total_rows

  else:
    #resize the dataset to accomodate the new data
    create_dataset.resize(total_rows, axis=0)
    create_dataset[where_to_start_appending:total_rows, :] = your_data
    where_to_start_appending = total_rows

output_file.close()
like image 170
Heather QC Avatar answered Nov 13 '22 06:11

Heather QC