Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big HDF5 dataset, how to efficienly shuffle after each epoch

I'm currently working with a big image dataset (~60GB) to train a CNN (Keras/Tensorflow) for a simple classification task. The images are video frames, and thus highly correlated in time, so I shuffled the data already once when generating the huge .hdf5 file... To feed the data into the CNN without having to load the whole set at once into memory I wrote a simple batch generator (see code below). Now my question: Usually it is recommended to shuffle the data after each training epoch right? (for SGD convergence reasons?) But to do so I'd have to load the whole dataset after each epoch and shuffle it, which is exactly what I wanted to avoid using the batch generator... So: Is it really that important to shuffle the dataset after each epoch and if yes how could I do that as efficiently as possible? Here is the current code of my batch generator:

def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])

while 1:
    # count how many entries we have read
    n_entries = 0
    # as long as we haven't read all entries from the file: keep reading
    while n_entries < (filesize - batch_size):
        # start the next batch at index 0
        # create numpy arrays of input data (features)
        xs = hdf5_file['images'][n_entries: n_entries + batch_size]
        xs = np.reshape(xs, dimensions).astype('float32')

        # and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
        y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
        #ys = keras.utils.to_categorical(y_values, num_classes)
        ys = to_categorical(y_values, num_classes)

        # we have read one more batch from this file
        n_entries += batch_size
        yield (xs, ys)
like image 936
nkaenzig Avatar asked Oct 05 '17 08:10

nkaenzig


Video Answer


1 Answers

Yeah, shuffling improves performance since running the data in the same order each time may get you stuck in suboptimal areas.

Don't shuffle the entire data. Create a list of indices into the data, and shuffle that instead. Then move sequentially over the index list and use its values to pick data from the data set.

like image 51
alkanen Avatar answered Oct 19 '22 15:10

alkanen