How to append data to one specific dataset in a hdf5 file with h5py

Tags:

I am looking for a possibility to append data to an existing dataset inside a .h5 file using Python (h5py).

A short intro to my project: I try to train a CNN using medical image data. Because of the huge amount of data and heavy memory usage during the transformation of the data to NumPy arrays, I needed to split the "transformation" into a few data chunks: load and preprocess the first 100 medical images and save the NumPy arrays to hdf5 file, then load the next 100 datasets and append the existing .h5 file, and so on.

Now, I tried to store the first 100 transformed NumPy arrays as follows:

import h5py from LoadIPV import LoadIPV  X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()  with h5py.File('.\PreprocessedData.h5', 'w') as hf:     hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))     hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))     hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))     hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))

As can be seen, the transformed NumPy arrays are splitted into four different "groups" that are stored into the four hdf5 datasets[X_train, X_test, Y_train, Y_test]. The LoadIPV() function performs the preprocessing of the medical image data.

My problem is that I would like to store the next 100 NumPy arrays into the same .h5 file into the existing datasets: that means that I would like to append to, for example, the existing X_train dataset of shape [100, 512, 512, 9] with the next 100 NumPy arrays, such that X_train becomes of shape [200, 512, 512, 9]. The same should work for the other three datasets X_test, Y_train and Y_test.

451

asked Nov 02 '17 10:11

Midas.Inc

1 Answers

I have found a solution that seems to work!

Have a look at this: incremental writes to hdf5 with h5py!

In order to append data to a specific dataset it is necessary to first resize the specific dataset in the corresponding axis and subsequently append the new data at the end of the "old" nparray.

Thus, the solution looks like this:

with h5py.File('.\PreprocessedData.h5', 'a') as hf:     hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)     hf["X_train"][-X_train_data.shape[0]:] = X_train_data      hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)     hf["X_test"][-X_test_data.shape[0]:] = X_test_data      hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)     hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data      hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)     hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data

However, note that you should create the dataset with maxshape=(None,), for example

h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,))

otherwise the dataset cannot be extended.

148

answered Sep 18 '22 06:09

Midas.Inc

Related questions
                            
                                Skip first line(field) in loop using CSV file? [duplicate]
                            
                                Tensorflow Compile Runs For A Long Time
                            
                                Why are slices in Python 3 still copies and not views?
                            
                                Issues implementing the "Wave Collapse Function" algorithm in Python
                            
                                How to import a Python module from a sibling folder?
                            
                                Forward declaration of classes?
                            
                                Python for C++ Developers [closed]
                            
                                Errno 10061 : No connection could be made because the target machine actively refused it ( client - server )
                            
                                Django: what is the difference (rel & field)
                            
                                Celery task that runs more tasks
                            
                                TypeError: Object of type 'bytes' is not JSON serializable
                            
                                What is __peg_parser__ in Python?
                            
                                What can multiprocessing and dill do together?
                            
                                Get date object for the first/last day of the current year
                            
                                How does Python's "super" do the right thing?
                            
                                Parsing SQL with Python
                            
                                Set literal gives different result from set function call
                            
                                How can I learn more about Python’s internals? [closed]
                            
                                What's the difference between python3.<x> and python3.<x>m [duplicate]
                            
                                Empty class object in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to append data to one specific dataset in a hdf5 file with h5py

Tags:

python

hdf5

numpy

deep-learning

h5py

Midas.Inc

People also ask

1 Answers

Midas.Inc

Recent Activity

Donate For Us