Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to utilize 100% of GPU memory with Tensorflow?

I have a 32Gb graphics card and upon start of my script I see:

2019-07-11 01:26:19.985367: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 95.16G (102174818304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.988090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 85.64G (91957338112 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.990806: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 77.08G (82761605120 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.993527: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 69.37G (74485440512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.996219: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 62.43G (67036893184 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.998911: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 56.19G (60333203456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.001601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 50.57G (54299881472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.004296: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 45.51G (48869892096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.006981: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 40.96G (43982901248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.009660: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 36.87G (39584608256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.012341: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 33.18G (35626147840 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

After which TF settles with using 96% of my memory. And later, when it runs out of memory it tries to allocate 65G

tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 65.30G (70111285248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

My question is, what about remaining 1300MB (0.04*32480)? I would not mind using those before running OOM.

How can I make TF utilize 99.9% of memory instead of 96%?

Update: nvidia-smi output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:16.0 Off |                    0 |
| N/A   66C    P0   293W / 300W |  31274MiB / 32480MiB |    100%      Default |

I am asking about these 1205MB (31274MiB - 32480MiB) remaining unused. Maybe they are there for a reason, maybe they are used just before OOM.

like image 860
y.selivonchyk Avatar asked Jul 11 '19 17:07

y.selivonchyk


People also ask

How much GPU is required for TensorFlow?

Software requirementsNVIDIA® GPU drivers version 450.80.02 or higher. CUDA® Toolkit 11.2.

Can we use GPU for faster computations in TensorFlow?

In a single clock cycle, enable tensorflow for GPU computation which can carry a lot of data(compared to CPU) for calculation, doing training a lot faster and allowing for better memory management.


1 Answers

Monitoring GPU is not as simple as monitoring CPU. There are many parallel processes going on which could create a bottleneck for your GPU.

There could be various problems like :
1. Read/Write speed for your data
2. Either CPU or disk is causing a bottleneck

But I think it is pretty normal to use 96%. Not to mention nvidia-smi only shows for one specific instance.

You can install gpustat and use it to monitor GPU live(you should be hitting 100% during OOM)

pip install gpustat

gpustat -i

What can you do ?
1. You can use data_iterator to process the data in parallel faster.
2. Increase batch size. (I dont think this will work in your case as you are hitting OOM)
3. You can overclock the GPU(not-recommended)

Here is a nice article for hardware accelaration.

like image 157
ASHu2 Avatar answered Sep 20 '22 09:09

ASHu2