Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to measure details GPU memory usages in Tensorflow

Tags:

I'm trying to use tensorflow profile to measure the detail GPU memory usage such as conv1 activations, weights, etc. I tried to use TF profile. It reported 4000MB peak usage. But at the same time, I measured using nvidia-smi, which reported 10000MB usage. They have big difference and I don't know what's the root cause. Can anyone give some suggestions to proceed?

TF profile:

enter image description here

nvidia-smi:

enter image description here

Tensorflow version: 1.9.0

like image 272
user5473110 Avatar asked Jan 03 '20 18:01

user5473110


People also ask

How do I check graphics card memory in Python?

You will need to install nvidia-ml-py3 library in python (pip install nvidia-ml-py3) which provides the bindings to NVIDIA Management library. Here is the code snippet: Thats it!

How do I limit GPU memory usage TensorFlow?

Limiting GPU memory growth To limit TensorFlow to a specific set of GPUs, use the tf. config. set_visible_devices method. In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process.


1 Answers

First, TF would always allocate most if not all available GPU memory when it starts. It actually allows TF to use memory more effectively. To change this behavior one might want to set an environment flag export TF_FORCE_GPU_ALLOW_GROWTH=true. More options are available here.

Once you've done that, nvidia-smi would still report exaggerated memory usage numbers. Because TF nvidia-smi reports allocated memory, while profiler reports actual peak memory being in use.

BFC is used as memory allocator. Whenever TF runs out of, say, 4GB of memory it would allocate twice the amount of 8GB. Next time it would try to allocate 16GB. At the same time the program might only use 9GB of memory on pick, but 16GB allocation would be reported by nvidia-smi. Also, BFC is not the only thing that allocates GPU memory in tensorflow, so, it can actually use 9GB+something.

Another comment here would be, that tensorflow native tools for reporting memory usage were not particularly precise in the past. So, I would allow myself to say, that profiler might actually be somewhat underestimating peak memory usage.

Here is some info on memory management https://github.com/miglopst/cs263_spring2018/wiki/Memory-management-for-tensorflow

Another a bit advanced link for checking memory usage: https://github.com/yaroslavvb/memory_util

like image 68
y.selivonchyk Avatar answered Sep 30 '22 15:09

y.selivonchyk