Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras difference between generator and sequence

I'm using a deep CNN+LSTM network to perfom a classification on a dataset of 1D signals. I'm using keras 2.2.4 backed by tensorflow 1.12.0. Since I have a large dataset and limited resources, I'm using a generator to load the data into the memory during the training phase. First, I tried this generator:

def data_generator(batch_size, preproc, type, x, y):     num_examples = len(x)     examples = zip(x, y)     examples = sorted(examples, key = lambda x: x[0].shape[0])     end = num_examples - batch_size + 1     batches = [examples[i:i + batch_size] for i in range(0, end, batch_size)]      random.shuffle(batches)     while True:         for batch in batches:             x, y = zip(*batch)             yield preproc.process(x, y) 

Using the above method, I'm able to launch training with a mini-batch size up to 30 samples at a time. However, this kind of method does not guarantee that the network will only train once on each sample per epoch. Considering this comment from Keras's website:

Sequence is a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

I've tried another way of loading data using the following class:

class Data_Gen(Sequence):  def __init__(self, batch_size, preproc, type, x_set, y_set):     self.x, self.y = np.array(x_set), np.array(y_set)     self.batch_size = batch_size     self.indices = np.arange(self.x.shape[0])     np.random.shuffle(self.indices)     self.type = type     self.preproc = preproc  def __len__(self):     # print(self.type + ' - len : ' + str(int(np.ceil(self.x.shape[0] / self.batch_size))))     return int(np.ceil(self.x.shape[0] / self.batch_size))  def __getitem__(self, idx):     inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]     batch_x = self.x[inds]     batch_y = self.y[inds]     return self.preproc.process(batch_x, batch_y)  def on_epoch_end(self):     np.random.shuffle(self.indices) 

I can confirm that using this method the network is training once on each sample per epoch but this time when I put more than 7 samples in the mini-batch, I got out of memory error:

OP_REQUIRES failed at random_op.cc: 202: Resource exhausted: OOM when allocating tensor with shape...............

I can confirm that I'm using the same model architecture, configuration, and machine to do this test. I'm wondering why would be a difference between these 2 ways of loading data??

Please don't hesitate to ask for more details in case needed.

Thanks in advance.

EDITED:

Here is the code I'm using to fit the model:

reduce_lr = keras.callbacks.ReduceLROnPlateau(             factor=0.1,             patience=2,             min_lr=params["learning_rate"])          checkpointer = keras.callbacks.ModelCheckpoint(             filepath=str(get_filename_for_saving(save_dir)),             save_best_only=False)          batch_size = params.get("batch_size", 32)          path = './logs/run-{0}'.format(datetime.now().strftime("%b %d %Y %H:%M:%S"))         tensorboard = keras.callbacks.TensorBoard(log_dir=path, histogram_freq=0,                                                   write_graph=True, write_images=False)         if index == 0:             print(model.summary())             print("Model memory needed for batchsize {0} : {1} Gb".format(batch_size, get_model_memory_usage(batch_size, model)))          if params.get("generator", False):             train_gen = load.data_generator(batch_size, preproc, 'Train', *train)             dev_gen = load.data_generator(batch_size, preproc, 'Dev', *dev)             valid_metrics = Metrics(dev_gen, len(dev[0]) // batch_size, batch_size)             model.fit_generator(                 train_gen,                 steps_per_epoch=len(train[0]) / batch_size + 1 if len(train[0]) % batch_size != 0 else len(train[0]) // batch_size,                 epochs=MAX_EPOCHS,                 validation_data=dev_gen,                 validation_steps=len(dev[0]) / batch_size + 1  if len(dev[0]) % batch_size != 0 else len(dev[0]) // batch_size,                 callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])              # train_gen = load.Data_Gen(batch_size, preproc, 'Train', *train)             # dev_gen = load.Data_Gen(batch_size, preproc, 'Dev', *dev)             # model.fit_generator(         #     train_gen,         #     epochs=MAX_EPOCHS,         #     validation_data=dev_gen,         #     callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard]) 
like image 419
Maystro Avatar asked Jun 05 '19 12:06

Maystro


People also ask

What is generator in keras?

The idea behind using a Keras generator is to get batches of input and corresponding output on the fly during training process, e.g. reading in 100 images, getting corresponding 100 label vectors and then feeding this set to the gpu for training step.

What is sequence in keras?

Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

What is data generator in Python?

The Python Data Generator transform lets you generate data by writing scripts using the Python programming language. This is similar to the Python Analysis transform except that it does not accept input from a preceding transform and generates its own output directly from Python.


1 Answers

Those methods are roughly the same. It is correct to subclass Sequence when your dataset doesn't fit in memory. But you shouldn't run any preprocessing in any of the class' methods because that will be reexecuted once per epoch wasting lots of computing resources.

It is probably also easier to shuffle the samples rather than their indices. Like this:

from random import shuffle

class DataGen(Sequence):     def __init__(self, batch_size, preproc, type, x_set, y_set):         self.samples = list(zip(x, y))         self.batch_size = batch_size         shuffle(self.samples)         self.type = type         self.preproc = preproc      def __len__(self):         return int(np.ceil(len(self.samples) / self.batch_size))      def __getitem__(self, i):         batch = self.samples[i * self.batch_size:(i + 1) * self.batch_size]         return self.preproc.process(*zip(batch))      def on_epoch_end(self):         shuffle(self.samples) 

I think it is impossible to say why you run out of memory without knowing more about your data. My guess would be that your preproc function is doing something wrong. You can debug it by running:

for e in DataGen(batch_size, preproc, *train):     print(e) for e in DataGen(batch_size, preproc, *dev):     print(e) 

You will most likely run out of memory.

like image 196
Björn Lindqvist Avatar answered Sep 21 '22 08:09

Björn Lindqvist