"Resource exhausted: OOM when allocating tensor" during Retraining of GPT 2 Model:

Question

I am using the Friends Dialogues as a dataset to train using GPT-2 for a conversational AI, however, it is showing I am out of memory. I know this issue has been resolved on StackOverflow, but I can't figure out optimizing for NLP task.

I have tried setting the batch size to 50 (My dataset has around 60k lines). I have been following this tutorial on retraining GPT-2 on a custom dataset.

My system specs are: OS: Windows 10 RAM: 16 GB CPU: i7 8th Gen GPU: 4GB Nvidia GTX 1050Ti

This is the entire error message

Resource exhausted: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/h0/attn/c_attn/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[{{node Mean}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 293, in <module>
    main()
  File "train.py", line 271, in main
    feed_dict={context: sample_batch()})
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'model/h0/attn/c_attn/MatMul', defined at:
  File "train.py", line 293, in <module>
    main()
  File "train.py", line 93, in main
    output = model.model(hparams=hparams, X=context_in)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 164, in model
    h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 126, in block
    a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 102, in attn
    c = conv1d(x, 'c_attn', n_state*3)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 55, in conv1d
    c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\ops\math_ops.py", line 2455, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\ops\gen_math_ops.py", line 5333, in mat_mul
    name=name)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\framework\ops.py", line 3300, in create_op
    op_def=op_def)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs	f_gpu\lib\site-packages	ensorflow\python\framework\ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
         [[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

T. Kelher · Accepted Answer

It always means the same thing I guess. Try a batch size of 1 and see if that works. Then boost the batch size to find how much your gpu can handle. If it can't handle a batch size of 1 the model might be too big for your gpu. If you do not get this error immediately, check if the code is okay, maybe there are some bugs in it. Oh and perhaps you should check what else is using your gpu, just to be sure that nothing unnecessary is taking up the resources.

"Resource exhausted: OOM when allocating tensor" during Retraining of GPT 2 Model:

Tags:

python

memory-management

tensorflow

gpu

tensor

Bhavesh Laddagiri

1 Answers

T. Kelher

Recent Activity

Donate For Us

"Resource exhausted: OOM when allocating tensor" during Retraining of GPT 2 Model:

Tags:

python

memory-management

tensorflow

gpu

tensor

Bhavesh Laddagiri

1 Answers

T. Kelher

Related questions

Recent Activity

Donate For Us