Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training a Keras model from batches of .npy files using generator?

Currently I am dealing with a big data issue when training Image data using Keras. I have directory which has batch of .npy file. Each batch contain 512 images. Each batch has its corresponding label file as .npy. So it looks like: {image_file_1.npy, label_file_1.npy, ..., image_file_37.npy, label_file_37}. Each image file has dimension (512, 199, 199, 3), each label file has dimension (512, 1)(eather 1 or 0) . If I load all the images in one ndarray it will be 35+ GB. So far reading all the Keras Doc. I am still not able to find how I will be able to train using custom generator. I have read about flow_from_dict and ImageDataGenerator(...).flow() but they are not ideal in that case or I do not know how to customized them.Here what I have done.

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator

val_gen = ImageDataGenerator(rescale=1./255)
x_test = np.load("../data/val_file.npy")
y_test = np.load("../data/val_label.npy")
val_gen.fit(x_test)

model = Sequential()
...
model_1.add(layers.Dense(512, activation='relu'))
model_1.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='categorical_crossentropy', 
              optimizer=sgd, 
               metrics=['acc'])

model.fit_generator(generate_batch_from_directory() # should give 1 image file and 1 label file
                    validation_data=val_gen.flow(x_test, 
                                                 y_test, 
                                                 batch_size=64),
                    validation_steps=32)

So here generate_batch_from_directory() should take image_file_i.npy and label_file_i.npy every time and optimise the weight until there is no batch left. Each image array in the .npy files has already been processed with augmentation, rotation and scaling. Each .npy file is properly mixed with data from class 1 and 0 (50/50).

If I append all the batch and create a big file such as:

X_train = np.append([image_file_1, ..., image_file_37])
y_train = np.append([label_file_1, ..., label_file_37])

It does not fit in the memory. Otherwise I could use .flow() to generate image sets to train the model.

Thanks for any advise.

like image 936
DataPsycho Avatar asked Dec 15 '18 00:12

DataPsycho


1 Answers

Finally I was able to solve that problem. But I had to go through source code and documentation of keras.utils.Sequence to build my own generator class. This document help a lot to understand how generator works in Kears. You can read more detail in my kaggle notebook:

all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)

image_label_map = {
        "image_file_{}.npy".format(i+1): "label_file_{}.npy".format(i+1)
        for i in range(int(len(all_files)/2))}
partition = [item for item in all_files if "image_file" in item]

class DataGenerator(keras.utils.Sequence):

    def __init__(self, file_list):
        """Constructor can be expanded,
           with batch size, dimentation etc.
        """
        self.file_list = file_list
        self.on_epoch_end()

    def __len__(self):
      'Take all batches in each iteration'
      return int(len(self.file_list))

    def __getitem__(self, index):
      'Get next batch'
      # Generate indexes of the batch
      indexes = self.indexes[index:(index+1)]

      # single file
      file_list_temp = [self.file_list[k] for k in indexes]

      # Set of X_train and y_train
      X, y = self.__data_generation(file_list_temp)

      return X, y

    def on_epoch_end(self):
      'Updates indexes after each epoch'
      self.indexes = np.arange(len(self.file_list))

    def __data_generation(self, file_list_temp):
      'Generates data containing batch_size samples'
      data_loc = "datapsycho/imglake/population/train/image_files/"
      # Generate data
      for ID in file_list_temp:
          x_file_path = os.path.join(data_loc, ID)
          y_file_path = os.path.join(data_loc, image_label_map.get(ID))

          # Store sample
          X = np.load(x_file_path)

          # Store class
          y = np.load(y_file_path)

      return X, y

# ====================
# train set
# ====================
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)

training_generator = DataGenerator(partition)
validation_generator = ValDataGenerator(val_partition) # work same as training generator

hst = model.fit_generator(generator=training_generator, 
                           epochs=200, 
                           validation_data=validation_generator,
                           use_multiprocessing=True,
                           max_queue_size=32)
like image 59
DataPsycho Avatar answered Oct 20 '22 08:10

DataPsycho