Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to read batches in one hdf5 data file for training?

I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.

I want to ask:

How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?

like image 580
karl_TUM Avatar asked Jul 06 '16 13:07

karl_TUM


People also ask

How do I explore HDF5 files?

Open a HDF5/H5 file in HDFView hdf5 file on your computer. Open this file in HDFView. If you click on the name of the HDF5 file in the left hand window of HDFView, you can view metadata for the file.

How are HDF5 files structured?

HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets. HDF5 group: a grouping structure containing instances of zero or more groups or datasets, together with supporting metadata. HDF5 dataset: a multidimensional array of data elements, together with supporting metadata.

WHAT IS group in HDF5?

Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries. In this case the “keys” are the names of group members, and the “values” are the members themselves ( Group and Dataset ) objects.


2 Answers

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:

import h5py
import tensorflow as tf

data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
    current_data = data['data_set'][position:position+batch_size]
    sess.run(train_op, feed_dict={input: current_data})

You can also run through a huge number of iterations and randomly select a batch if you want to:

import random
for i in range(iterations):
    pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

Or sequentially:

for i in range(iterations):
    pos = (i % int(data_size / batch_size)) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.

like image 138
alkanen Avatar answered Oct 27 '22 22:10

alkanen


You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :

import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
  sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})
like image 27
keveman Avatar answered Oct 27 '22 21:10

keveman