(GTX 1080, Tensorflow 1.0.0)
During the training nvidia-smi
output (below) suggests that the GPU utilization is 0% most of the time (despite usage of GPU). Regarding the time I already train, that seems to be the case. Once in a while it peaks up to 100% or similar, for a second though.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A |
| 33% 35C P2 49W / 190W | 7982MiB / 8110MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1093 G /usr/lib/xorg/Xorg 175MiB |
| 0 1915 G compiz 90MiB |
| 0 4383 C python 7712MiB |
+-----------------------------------------------------------------------------+
The situation occurs to me as I described in this issue. The problem can be replicated either with the code from that github repository or by following this simple retraining example from tensorflow's website and passing restricted per_process_gpu_memory_fraction (less than 1.0) like that in the session:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Question 1: How to really use (utilize) GPU during the training while using <1.0 of the GPU?
Question 2: How to really use full GPU (not setting it to <1.0) with my graphic card?
Help&hints appreciated!
When you create a graph that is bigger than the memory of the GPU TensorFlow falls back to the CPU, to it uses RAM and CPU instead of the GPU. So just remove the option for per_process_gpu_memory_fraction
and decrement the batch size. Most probably the examples uses a big batch size because it was trained in more than one GPU or in a CPU with >32Gb, which it is not your case. It can also be the optimizer algorithm you chose. SGD uses less memory than other algorithms, try to set it first. With 8Gb in the GPU you can try a batch size of 16 and SGD, it should work. Then you can increase the batch size or use other algoritms like RMSprop.
If it is still not working you probably are doing something else. Like for example you are saving a checkpoint in every iteration. Saving a checkpoints is done in the CPU and probably takes much more time than a simple iteration in the GPU. That could be the reason you are seeing spikes in the GPU usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With