how to programmatically determine available GPU memory with tensorflow?

Tags:

For a vector quantization (k-means) program I like to know the amount of available memory on the present GPU (if there is one). This is needed to choose an optimal batch size in order to have as few batches as possible to run over the complete data set.

I have written the following test program:

import tensorflow as tf
import numpy as np
from kmeanstf import KMeansTF
print("GPU Available: ", tf.test.is_gpu_available())

nn=1000
dd=250000
print("{:,d} bytes".format(nn*dd*4))
dic = {}
for x in "ABCD":
    dic[x]=tf.random.normal((nn,dd))
    print(x,dic[x][:1,:2])

print("done...")

This is a typical output on my system with (ubuntu 18.04 LTS, GTX-1060 6GB). Please note the core dump.

python misc/maxmem.py 
GPU Available:  True
1,000,000,000 bytes
A tf.Tensor([[-0.23787294 -2.0841186 ]], shape=(1, 2), dtype=float32)
B tf.Tensor([[ 0.23762687 -1.1229591 ]], shape=(1, 2), dtype=float32)
C tf.Tensor([[-1.2672468   0.92139906]], shape=(1, 2), dtype=float32)
2020-01-02 17:35:05.988473: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000).  Current allocation summary follows.
2020-01-02 17:35:05.988752: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************xx
2020-01-02 17:35:05.988835: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Segmentation fault (core dumped)

Occasionally I do get an error from python instead of a core dump (see below). This would actually be better since I could catch it and thus determine by trial and error the maximum available memory. But it alternates with core dumps:

python misc/maxmem.py 
GPU Available:  True
1,000,000,000 bytes
A tf.Tensor([[-0.73510283 -0.94611156]], shape=(1, 2), dtype=float32)
B tf.Tensor([[-0.8458411  0.552555 ]], shape=(1, 2), dtype=float32)
C tf.Tensor([[0.30532074 0.266423  ]], shape=(1, 2), dtype=float32)
2020-01-02 17:35:26.401156: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000).  Current allocation summary follows.
2020-01-02 17:35:26.401486: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************xx
2020-01-02 17:35:26.401571: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "misc/maxmem.py", line 11, in <module>
    dic[x]=tf.random.normal((nn,dd))
  File "/home/fritzke/miniconda2/envs/tf20b/lib/python3.7/site-packages/tensorflow_core/python/ops/random_ops.py", line 76, in random_normal
    value = math_ops.add(mul, mean_tensor, name=name)
  File "/home/fritzke/miniconda2/envs/tf20b/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 391, in add
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: random_normal/

How could I reliably get this information for whatever system the software is running on?

903

asked Jan 02 '20 17:01

Barden

Video Answer

3 Answers

I actually found an answer in this old question of mine . To bring some additional benefit to readers I tested the mentioned program

import nvidia_smi

nvidia_smi.nvmlInit()

handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
# card id 0 hardcoded here, there is also a call to get all available card ids, so we could iterate

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print("Total memory:", info.total)
print("Free memory:", info.free)
print("Used memory:", info.used)

nvidia_smi.nvmlShutdown()

on colab with the following result:

Total memory: 17071734784
Free memory: 17071734784
Used memory: 0

The actual GPU I had there was a Tesla P100 as can be seen from executing

!nvidia-smi

and observing the output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

164

answered Sep 19 '22 08:09

Barden

This code will return free GPU memory in MegaBytes for each GPU:

import subprocess as sp
import os

def get_gpu_memory():
    command = "nvidia-smi --query-gpu=memory.free --format=csv"
    memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
    return memory_free_values

get_gpu_memory()

This answer relies on nvidia-smi being installed (which is pretty much always the case for Nvidia GPUs) and therefore is limited to NVidia GPUs.

answered Sep 22 '22 08:09

y.selivonchyk

If you're using tensorflow-gpu==2.5, you can use

tf.config.experimental.get_memory_info('GPU:0')

to get the actual consumed GPU memory by TF. Nvidia-smi tells you nothing, as TF allocates everything for itself and leaves nvidia-smi no information to track how much of that pre-allocated memory is actually being used.

answered Sep 18 '22 08:09

Captain Trojan

Related questions
                            
                                Virtualenv not creating an environment
                            
                                How can I do a batch insert into an Oracle database using Python?
                            
                                Copying data from S3 to AWS redshift using python and psycopg2
                            
                                Python Fabric - No hosts found. Please specify (single) host string for connection:
                            
                                iter, values, item in dictionary does not work
                            
                                How do I change the pygame icon?
                            
                                Django can't access raw_post_data
                            
                                How to apply style sheet to a custom widget in PyQt
                            
                                tkinter canvas image not displaying
                            
                                TypeError: cannot concatenate 'str' and 'list' objects in email
                            
                                Django Test Run Environment error: no enough space left on disk
                            
                                calculate mod using pow function python
                            
                                How can I remove everything in a string until a character(s) are seen in Python
                            
                                Custom Frame Duration for Animated Gif in Python ImageIO
                            
                                How to get date after subtracting days in pandas
                            
                                Return the maximum value from a dictionary [duplicate]
                            
                                How can I create the fibonacci series using a list comprehension?
                            
                                Delete all versions of an object in S3 using python?
                            
                                Problems upgrading Ipython (prompt_toolkit incompatibilities)
                            
                                flattening nested Json in pandas data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to programmatically determine available GPU memory with tensorflow?

Tags:

python

tensorflow

gpu

Barden

People also ask

Video Answer

3 Answers

Barden

y.selivonchyk

Captain Trojan

Recent Activity

Donate For Us