Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make sure the training phase won't be facing an OOM?

The question in the title is complete. "How to make sure the training phase won't be facing an OOM?"

Just some side notes, based on my experience there are two cases of OOM. One is when the memory needed for your model and the mini-batch is bigger than the memory you have. In such cases, the training phase will never start. And the solution to fix this is to use smaller batch sizes. Even though it would have been great if I could calculate the biggest batch size that my hardware can manage for some particular model. But even if I cannot find the biggest batch size with the first try, I can always find it with some trial and error (since the process fails right away).

The second scenario that I'm facing OOM is when the training process starts, and it goes on for some time. Maybe even a few epochs. But then for some unknown reason, it faces the OOM. For me, this scenario is a frustrating one. Because it could happen any time and you'll never know if the training that is ongoing will ever finish or not. So far, I have lost days of training time while I thought everything is going forward just fine.

I think some clarifications are in order. First of all, I'm talking about a personal computer with a GPU. And secondly, the GPU is dedicated to computing and is not used for display. Correct me if I'm wrong, but I believe this means that the training process demands different memory sizes at different points of time. How could that be? And once again, how can I make sure that my training phase won't be facing an OOM?

Take this run for example:

3150/4073 [======================>.......] - ETA: 53:39 - loss: 0.3323
2019-10-13 21:41:13.096320: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 60.81MiB (rounded to 63766528).  Current allocation summary follows.

After three hours of training, TensorFlow asked for more memory than my hardware could provide. My question is, why this increase of memory allocation at this point and not at the beginning of the process?

[UPDATE]

In the light of known issues with eager mode, I'll add some clarification to my case. I'm not coding in eager mode. And here's how my training code looks like:

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
training_dataset = tf.data.Dataset.from_tensor_slices(...)
validation_dataset = tf.data.Dataset.from_tensor_slices(...)

with strategy.scope():
    model = create_model()

    model.compile(optimizer='adam', loss='categorical_crossentropy')

    pocket = EarlyStopping(monitor='val_loss', min_delta=0.001,
                           patience=5, verbose=1,
                           restore_best_weights = True)

    history = model.fit(training_dataset.shuffle(buffer_size=1000).batch(30),
                        epochs=3,
                        callbacks=[pocket],
                        validation_data=validation_dataset.shuffle(buffer_size=1000).batch(30),
                        workers=3, use_multiprocessing=True)
like image 858
Mehran Avatar asked Oct 13 '19 18:10

Mehran


1 Answers

There's a known memory leak [1] that happens if you train repeatedly in a loop. The solution to that is call tf.keras.backend.clear_session() and possibly gc.collect() every now and then in the loop.

The behavior seems to be different in TF 1.15 and 2.0 though and this might not be enough to fix it. I find that tf.keras.backend.clear_session() in my training loop on CPU resets a gradual memory leak without hurting the training.

[1] https://github.com/tensorflow/tensorflow/issues/30324

like image 71
user1318499 Avatar answered Oct 20 '22 22:10

user1318499