I'm using a deep CNN+LSTM network to perfom a classification on a dataset of 1D signals. I'm using keras 2.2.4
backed by tensorflow 1.12.0
. Since I have a large dataset and limited resources, I'm using a generator to load the data into the memory during the training phase. First, I tried this generator:
def data_generator(batch_size, preproc, type, x, y): num_examples = len(x) examples = zip(x, y) examples = sorted(examples, key = lambda x: x[0].shape[0]) end = num_examples - batch_size + 1 batches = [examples[i:i + batch_size] for i in range(0, end, batch_size)] random.shuffle(batches) while True: for batch in batches: x, y = zip(*batch) yield preproc.process(x, y)
Using the above method, I'm able to launch training with a mini-batch size up to 30 samples at a time. However, this kind of method does not guarantee that the network will only train once on each sample per epoch. Considering this comment from Keras's website:
Sequence
is a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.
I've tried another way of loading data using the following class:
class Data_Gen(Sequence): def __init__(self, batch_size, preproc, type, x_set, y_set): self.x, self.y = np.array(x_set), np.array(y_set) self.batch_size = batch_size self.indices = np.arange(self.x.shape[0]) np.random.shuffle(self.indices) self.type = type self.preproc = preproc def __len__(self): # print(self.type + ' - len : ' + str(int(np.ceil(self.x.shape[0] / self.batch_size)))) return int(np.ceil(self.x.shape[0] / self.batch_size)) def __getitem__(self, idx): inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size] batch_x = self.x[inds] batch_y = self.y[inds] return self.preproc.process(batch_x, batch_y) def on_epoch_end(self): np.random.shuffle(self.indices)
I can confirm that using this method the network is training once on each sample per epoch but this time when I put more than 7 samples in the mini-batch, I got out of memory error:
OP_REQUIRES failed at random_op.cc: 202: Resource exhausted: OOM when allocating tensor with shape...............
I can confirm that I'm using the same model architecture, configuration, and machine to do this test. I'm wondering why would be a difference between these 2 ways of loading data??
Please don't hesitate to ask for more details in case needed.
Thanks in advance.
EDITED:
Here is the code I'm using to fit the model:
reduce_lr = keras.callbacks.ReduceLROnPlateau( factor=0.1, patience=2, min_lr=params["learning_rate"]) checkpointer = keras.callbacks.ModelCheckpoint( filepath=str(get_filename_for_saving(save_dir)), save_best_only=False) batch_size = params.get("batch_size", 32) path = './logs/run-{0}'.format(datetime.now().strftime("%b %d %Y %H:%M:%S")) tensorboard = keras.callbacks.TensorBoard(log_dir=path, histogram_freq=0, write_graph=True, write_images=False) if index == 0: print(model.summary()) print("Model memory needed for batchsize {0} : {1} Gb".format(batch_size, get_model_memory_usage(batch_size, model))) if params.get("generator", False): train_gen = load.data_generator(batch_size, preproc, 'Train', *train) dev_gen = load.data_generator(batch_size, preproc, 'Dev', *dev) valid_metrics = Metrics(dev_gen, len(dev[0]) // batch_size, batch_size) model.fit_generator( train_gen, steps_per_epoch=len(train[0]) / batch_size + 1 if len(train[0]) % batch_size != 0 else len(train[0]) // batch_size, epochs=MAX_EPOCHS, validation_data=dev_gen, validation_steps=len(dev[0]) / batch_size + 1 if len(dev[0]) % batch_size != 0 else len(dev[0]) // batch_size, callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard]) # train_gen = load.Data_Gen(batch_size, preproc, 'Train', *train) # dev_gen = load.Data_Gen(batch_size, preproc, 'Dev', *dev) # model.fit_generator( # train_gen, # epochs=MAX_EPOCHS, # validation_data=dev_gen, # callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])
The idea behind using a Keras generator is to get batches of input and corresponding output on the fly during training process, e.g. reading in 100 images, getting corresponding 100 label vectors and then feeding this set to the gpu for training step.
Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.
The Python Data Generator transform lets you generate data by writing scripts using the Python programming language. This is similar to the Python Analysis transform except that it does not accept input from a preceding transform and generates its own output directly from Python.
Those methods are roughly the same. It is correct to subclass Sequence
when your dataset doesn't fit in memory. But you shouldn't run any preprocessing in any of the class' methods because that will be reexecuted once per epoch wasting lots of computing resources.
It is probably also easier to shuffle the samples rather than their indices. Like this:
from random import shuffle
class DataGen(Sequence): def __init__(self, batch_size, preproc, type, x_set, y_set): self.samples = list(zip(x, y)) self.batch_size = batch_size shuffle(self.samples) self.type = type self.preproc = preproc def __len__(self): return int(np.ceil(len(self.samples) / self.batch_size)) def __getitem__(self, i): batch = self.samples[i * self.batch_size:(i + 1) * self.batch_size] return self.preproc.process(*zip(batch)) def on_epoch_end(self): shuffle(self.samples)
I think it is impossible to say why you run out of memory without knowing more about your data. My guess would be that your preproc
function is doing something wrong. You can debug it by running:
for e in DataGen(batch_size, preproc, *train): print(e) for e in DataGen(batch_size, preproc, *dev): print(e)
You will most likely run out of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With