Keras difference between generator and sequence

Tags:

I'm using a deep CNN+LSTM network to perfom a classification on a dataset of 1D signals. I'm using keras 2.2.4 backed by tensorflow 1.12.0. Since I have a large dataset and limited resources, I'm using a generator to load the data into the memory during the training phase. First, I tried this generator:

def data_generator(batch_size, preproc, type, x, y):     num_examples = len(x)     examples = zip(x, y)     examples = sorted(examples, key = lambda x: x[0].shape[0])     end = num_examples - batch_size + 1     batches = [examples[i:i + batch_size] for i in range(0, end, batch_size)]      random.shuffle(batches)     while True:         for batch in batches:             x, y = zip(*batch)             yield preproc.process(x, y)

Using the above method, I'm able to launch training with a mini-batch size up to 30 samples at a time. However, this kind of method does not guarantee that the network will only train once on each sample per epoch. Considering this comment from Keras's website:

Sequence is a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

I've tried another way of loading data using the following class:

class Data_Gen(Sequence):  def __init__(self, batch_size, preproc, type, x_set, y_set):     self.x, self.y = np.array(x_set), np.array(y_set)     self.batch_size = batch_size     self.indices = np.arange(self.x.shape[0])     np.random.shuffle(self.indices)     self.type = type     self.preproc = preproc  def __len__(self):     # print(self.type + ' - len : ' + str(int(np.ceil(self.x.shape[0] / self.batch_size))))     return int(np.ceil(self.x.shape[0] / self.batch_size))  def __getitem__(self, idx):     inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]     batch_x = self.x[inds]     batch_y = self.y[inds]     return self.preproc.process(batch_x, batch_y)  def on_epoch_end(self):     np.random.shuffle(self.indices)

I can confirm that using this method the network is training once on each sample per epoch but this time when I put more than 7 samples in the mini-batch, I got out of memory error:

OP_REQUIRES failed at random_op.cc: 202: Resource exhausted: OOM when allocating tensor with shape...............

I can confirm that I'm using the same model architecture, configuration, and machine to do this test. I'm wondering why would be a difference between these 2 ways of loading data??

Please don't hesitate to ask for more details in case needed.

Thanks in advance.

EDITED:

Here is the code I'm using to fit the model:

reduce_lr = keras.callbacks.ReduceLROnPlateau(             factor=0.1,             patience=2,             min_lr=params["learning_rate"])          checkpointer = keras.callbacks.ModelCheckpoint(             filepath=str(get_filename_for_saving(save_dir)),             save_best_only=False)          batch_size = params.get("batch_size", 32)          path = './logs/run-{0}'.format(datetime.now().strftime("%b %d %Y %H:%M:%S"))         tensorboard = keras.callbacks.TensorBoard(log_dir=path, histogram_freq=0,                                                   write_graph=True, write_images=False)         if index == 0:             print(model.summary())             print("Model memory needed for batchsize {0} : {1} Gb".format(batch_size, get_model_memory_usage(batch_size, model)))          if params.get("generator", False):             train_gen = load.data_generator(batch_size, preproc, 'Train', *train)             dev_gen = load.data_generator(batch_size, preproc, 'Dev', *dev)             valid_metrics = Metrics(dev_gen, len(dev[0]) // batch_size, batch_size)             model.fit_generator(                 train_gen,                 steps_per_epoch=len(train[0]) / batch_size + 1 if len(train[0]) % batch_size != 0 else len(train[0]) // batch_size,                 epochs=MAX_EPOCHS,                 validation_data=dev_gen,                 validation_steps=len(dev[0]) / batch_size + 1  if len(dev[0]) % batch_size != 0 else len(dev[0]) // batch_size,                 callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])              # train_gen = load.Data_Gen(batch_size, preproc, 'Train', *train)             # dev_gen = load.Data_Gen(batch_size, preproc, 'Dev', *dev)             # model.fit_generator(         #     train_gen,         #     epochs=MAX_EPOCHS,         #     validation_data=dev_gen,         #     callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])

419

asked Jun 05 '19 12:06

Maystro

1 Answers

Those methods are roughly the same. It is correct to subclass Sequence when your dataset doesn't fit in memory. But you shouldn't run any preprocessing in any of the class' methods because that will be reexecuted once per epoch wasting lots of computing resources.

It is probably also easier to shuffle the samples rather than their indices. Like this:

from random import shuffle

class DataGen(Sequence):     def __init__(self, batch_size, preproc, type, x_set, y_set):         self.samples = list(zip(x, y))         self.batch_size = batch_size         shuffle(self.samples)         self.type = type         self.preproc = preproc      def __len__(self):         return int(np.ceil(len(self.samples) / self.batch_size))      def __getitem__(self, i):         batch = self.samples[i * self.batch_size:(i + 1) * self.batch_size]         return self.preproc.process(*zip(batch))      def on_epoch_end(self):         shuffle(self.samples)

I think it is impossible to say why you run out of memory without knowing more about your data. My guess would be that your preproc function is doing something wrong. You can debug it by running:

for e in DataGen(batch_size, preproc, *train):     print(e) for e in DataGen(batch_size, preproc, *dev):     print(e)

You will most likely run out of memory.

196

answered Sep 21 '22 08:09

Björn Lindqvist

Related questions
                            
                                Escaping strings for use in XML
                            
                                How to check if a given number is a power of two?
                            
                                Select columns in PySpark dataframe
                            
                                Unknown initializer: GlorotUniform when loading Keras model
                            
                                Anaconda Navigator won't launch (windows 10)
                            
                                How do I split a string into a list?
                            
                                scikit learn output metrics.classification_report into CSV/tab-delimited format
                            
                                How to right-align numeric data?
                            
                                brew installation of Python 3.6.1: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
                            
                                Why program functionally in Python?
                            
                                Create multiple dataframes in loop
                            
                                How do I convert a tuple of tuples to a one-dimensional list using list comprehension? [duplicate]
                            
                                What is the pythonic way to calculate dot product?
                            
                                How do I convert a hex triplet to an RGB tuple and back?
                            
                                How can I make multiple empty lists in python?
                            
                                WebSocket server in Python: 'module' object has no attribute 'AF_INET'
                            
                                How can I format a list to print each element on a separate line in python? [duplicate]
                            
                                Django post_save preventing recursion without overriding model save()
                            
                                Python: Read data from Highcharts after setExtreme
                            
                                Pulling data from Neo4j using PySpark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Keras difference between generator and sequence

Tags:

python

generator

tensorflow

keras

Maystro

People also ask

1 Answers

Björn Lindqvist

Recent Activity

Donate For Us