Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Resource exhausted: OOM when allocating tensor" during Retraining of GPT 2 Model:

I am using the Friends Dialogues as a dataset to train using GPT-2 for a conversational AI, however, it is showing I am out of memory. I know this issue has been resolved on StackOverflow, but I can't figure out optimizing for NLP task.

I have tried setting the batch size to 50 (My dataset has around 60k lines). I have been following this tutorial on retraining GPT-2 on a custom dataset.

My system specs are: OS: Windows 10 RAM: 16 GB CPU: i7 8th Gen GPU: 4GB Nvidia GTX 1050Ti

This is the entire error message

Resource exhausted: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/h0/attn/c_attn/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[{{node Mean}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 293, in <module>
    main()
  File "train.py", line 271, in main
    feed_dict={context: sample_batch()})
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'model/h0/attn/c_attn/MatMul', defined at:
  File "train.py", line 293, in <module>
    main()
  File "train.py", line 93, in main
    output = model.model(hparams=hparams, X=context_in)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 164, in model
    h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 126, in block
    a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 102, in attn
    c = conv1d(x, 'c_attn', n_state*3)
  File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 55, in conv1d
    c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2455, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5333, in mat_mul
    name=name)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
    op_def=op_def)
  File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
         [[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
like image 275
Bhavesh Laddagiri Avatar asked Jan 26 '23 06:01

Bhavesh Laddagiri


1 Answers

It always means the same thing I guess. Try a batch size of 1 and see if that works. Then boost the batch size to find how much your gpu can handle. If it can't handle a batch size of 1 the model might be too big for your gpu. If you do not get this error immediately, check if the code is okay, maybe there are some bugs in it. Oh and perhaps you should check what else is using your gpu, just to be sure that nothing unnecessary is taking up the resources.

like image 103
T. Kelher Avatar answered Jan 31 '23 19:01

T. Kelher