Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

google colaboratory `ResourceExhaustedError` with GPU

I'm trying to fine-tune a Vgg16 model using colaboratory but I ran into this error when training with the GPU.

OOM when allocating tensor of shape [7,7,512,4096]

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor of shape [7,7,512,4096] and type float
     [[Node: vgg_16/fc6/weights/Momentum/Initializer/zeros = Const[_class=["loc:@vgg_16/fc6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,512,4096] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'vgg_16/fc6/weights/Momentum/Initializer/zeros', defined at:

also have this output for my vm session:

    --- colab vm info ---
python v=3.6.3
tensorflow v=1.4.1
tf device=/device:GPU:0
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
MemTotal:       13341960 kB
MemFree:         1541740 kB
MemAvailable:   10035212 kB

My tfrecord is just 118 256x256 JPGs with file size <2MB

Is there a workaround? it works when I use the CPU, just not the GPU

like image 246
michael Avatar asked Jan 29 '18 05:01

michael


2 Answers

Seeing a small amount of free GPU memory almost always indicates that you've created a TensorFlow session without the allow_growth = True option. See: https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth

If you don't set this option, by default, TensorFlow will reserve nearly all GPU memory when a session is created.

Good news: As of this week, Colab now sets this option by default, so you should see much lower growth as you use multiple notebooks on Colab. And, you can also inspect GPU memory usage per notebook by selecting 'Manage session's from the runtime menu.

enter image description here

Once selected, you'll see a dialog that lists all notebooks and the GPU memory each is consuming. To free memory, you can terminate runtimes from this dialog as well.

enter image description here

like image 137
Bob Smith Avatar answered Oct 14 '22 04:10

Bob Smith


I met the same issue, and I found my problem was caused by the code below:

from tensorflow.python.framework.test_util import is_gpu_available as tf
if tf()==True:
  device='/gpu:0'
else:
  device='/cpu:0'

I used below Code to check the GPU memory usage status and find the usage is 0% before running the code above, and it became 95% after.

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize    
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]

def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
print('GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB'.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))

printm()

Before:

Gen RAM Free: 12.7 GB I Proc size: 139.1 MB

GPU RAM Free: 11438MB | Used: 1MB | Util 0% | Total 11439MB

After:

Gen RAM Free: 12.0 GB I Proc size: 1.0 GB

GPU RAM Free: 564MB | Used: 10875MB | Util 95% | Total 11439MB

Somehow, is_gpu_available() managed consume most of the GPU memory without release them after, so instead, I used below code to detect the gpu status for me, problem solved

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
try:
  import GPUtil as GPU
  GPUs = GPU.getGPUs()
  device='/gpu:0'
except:
  device='/cpu:0'
like image 20
Jianming Lin Avatar answered Oct 14 '22 02:10

Jianming Lin