How to utilize 100% of GPU memory with Tensorflow?

Tags:

tensorflow

I have a 32Gb graphics card and upon start of my script I see:

2019-07-11 01:26:19.985367: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 95.16G (102174818304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.988090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 85.64G (91957338112 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.990806: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 77.08G (82761605120 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.993527: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 69.37G (74485440512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.996219: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 62.43G (67036893184 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.998911: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 56.19G (60333203456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.001601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 50.57G (54299881472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.004296: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 45.51G (48869892096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.006981: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 40.96G (43982901248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.009660: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 36.87G (39584608256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.012341: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 33.18G (35626147840 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

After which TF settles with using 96% of my memory. And later, when it runs out of memory it tries to allocate 65G

tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 65.30G (70111285248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

My question is, what about remaining 1300MB (0.04*32480)? I would not mind using those before running OOM.

How can I make TF utilize 99.9% of memory instead of 96%?

Update: nvidia-smi output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:16.0 Off |                    0 |
| N/A   66C    P0   293W / 300W |  31274MiB / 32480MiB |    100%      Default |

I am asking about these 1205MB (31274MiB - 32480MiB) remaining unused. Maybe they are there for a reason, maybe they are used just before OOM.

860

asked Jul 11 '19 17:07

1 Answers

Monitoring GPU is not as simple as monitoring CPU. There are many parallel processes going on which could create a bottleneck for your GPU.

There could be various problems like :
1. Read/Write speed for your data
2. Either CPU or disk is causing a bottleneck

But I think it is pretty normal to use 96%. Not to mention nvidia-smi only shows for one specific instance.

You can install gpustat and use it to monitor GPU live(you should be hitting 100% during OOM)

pip install gpustat

gpustat -i

What can you do ?
1. You can use data_iterator to process the data in parallel faster.
2. Increase batch size. (I dont think this will work in your case as you are hitting OOM)
3. You can overclock the GPU(not-recommended)

Here is a nice article for hardware accelaration.

157

answered Sep 20 '22 09:09

ASHu2

Related questions
                            
                                Equality comparison does not work inside TensorFlow 2.0 tf.function()
                            
                                Compare two dataframe columns for matching percentage
                            
                                How to fill color by groups in histogram using Matplotlib?
                            
                                PySide2 equivalent of PyQt5's loadUiType() to dynamically mix in UI designs
                            
                                shared condaenv for multiple users on Windows
                            
                                How can I detect common elements lists and groupe lists with at least 1 common element?
                            
                                Multiprocessing: pool and map and sys.exit()
                            
                                How to merge rows in dataframe with different columns?
                            
                                Groupby object disappears after list operation
                            
                                Is there a significant overhead in calling `np.asarray' on a NumPy array?
                            
                                How to use Javascript in Spyder IDE?
                            
                                Python Define IFERROR function
                            
                                Confusion in accessing class attribute without self - python
                            
                                ImportError: cannot import name 'feature_column_v2' from 'tensorflow.python.tpu' using Object Detection API
                            
                                Reindex specific level of pandas MultiIndex
                            
                                How to install scikit-learn, pandas and numpy in a docker image?
                            
                                How to make my code stopable? (Not killing/interrupting)
                            
                                Change value of uneven to specific even numbers
                            
                                How to use the new support for ANSI escape sequences in the Windows 10 console?
                            
                                Syntax for an If statement using a boolean

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to utilize 100% of GPU memory with Tensorflow?

Tags:

python

tensorflow

y.selivonchyk

People also ask

1 Answers

ASHu2

Recent Activity

Donate For Us