Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tensorflow-GPU OOM issue after several epochs

Tags:

tensorflow

I used tensorflow to train CNN with Nvidia Geforce 1060 (6G memory), but I got a OOM exception.

The training process was fine on first two epochs, but got the OOM exception on the third epoch.

============================ 2017-10-27 11:47:30.219130: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************************************************************************************xxxxxx 2017-10-27 11:47:30.265389: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[10,10,48,48,48] Traceback (most recent call last): File"/anaconda3/lib/python3.6/sitepackages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/anaconda3/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,10,48,48,48] [[Node: gradients_4/global/detector_scope/maxpool_conv3d_2/MaxPool3D_grad/MaxPool3DGrad = MaxPool3DGrad[T=DT_FLOAT, TInput=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](global/detector_scope/maxpool_conv3d_2/transpose, global/detector_scope/maxpool_conv3d_2/MaxPool3D, gradients_4/global/detector_scope/maxpool_conv3d_2/transpose_1_grad/transpose)]] [[Node: Momentum_4/update/_540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1540_Momentum_4/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

=============================

So, I am confused why I got this OOM exception on third epoch after it finishes processing the first two epochs.

Given that datasets are the same during each epoch, if I ran out of GPU memory, I should get the exception on the first epoch. But I did successfully finish two epochs. So, why did this happen later ?

Any suggestions, please ?

like image 478
user426546 Avatar asked Oct 27 '17 19:10

user426546


1 Answers

There are two times you are likely to see OOM errors, as you first start training and after at least one epoch has completed.

The first situation is simply due to the model's memory size. For that, the easiest way is to reduce the batch size. If your model is really big and your batch size is now down to one, you still have a few options: reduce the size of hidden layers or move to a cloud instance with enough GPU or even CPU only execution so the static allocation of memory works.

For the second situation, you are likely running into a memory leak of sorts. Many training implementations use a callback on a hold-out dataset to get a validation score. This execution, say if called by Keras, may hold on to GPU session resources. These build up if not released and can cause a GPU instance to report OOM after several epochs. Others have suggested using a second GPU instance for the validation session, but I think a better approach is to have smarter validation callback session handling (specifically to release GPU session resources when each validation callback completes.)

Here is the pseudo code illustrating the callback problem. This callback leads to OOM:

my_models_validation_score = tf.get_some_v_score

This callback does not lead to OOM:

with tf.Session() as sess: 
    sess.run(get_some_v_score)

I invite others to help add to this response...

like image 67
Rick Lentz Avatar answered Oct 20 '22 17:10

Rick Lentz