For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code). When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment. Things I did to find out more about the cause of this problem: <ol> <li>Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.</li> <li> Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference. <pre class="prettyprint"><code>Avg. Time Batch Preparation: 0.001 seconds Avg. Time Train Operation: 0.021 seconds Avg. Time Total per Batch: 0.022 seconds (45.18 batches/second) </code></pre> </li> <li>Run on multiple machines to rule out hardware issues.</li> <li>Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.</li> </ol> An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI

With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard. These statistics include compute time and memory usage for each node in the computation graph.

TensorFlow - Low GPU usage on Titan X

Tags:

performance

profiling

tensorflow

gpu

conv-neural-network

For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code).

When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment.

Things I did to find out more about the cause of this problem:

Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.
Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference.
```
Avg. Time Batch Preparation: 0.001 seconds
Avg. Time Train Operation:   0.021 seconds
Avg. Time Total per Batch:   0.022 seconds (45.18 batches/second)
```
Run on multiple machines to rule out hardware issues.
Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.

An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py

Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI

897

asked May 30 '16 21:05

verified.human

1 Answers

With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard.

These statistics include compute time and memory usage for each node in the computation graph.

130

answered Oct 19 '22 18:10

Sunreef

Related questions
                            
                                A good way to do a fast divide in C++?
                            
                                Erlang/OTP - Timing Applications
                            
                                What's the most efficient way to determine whether an untrimmed string is empty in C#?
                            
                                Writing shorter code/algorithms, is more efficient (performance)?
                            
                                How much does the order of case labels affect the efficiency of switch statements?
                            
                                Fast, Simple Programmer's Editor
                            
                                How many MySQL queries should I limit myself to on a page? PHP / MySQL
                            
                                Analyzing Code for Efficiency?
                            
                                Is static method faster than non-static?
                            
                                Cost of throwing C++0x exceptions
                            
                                Why does my Python program average only 33% CPU per process? How can I make Python use all available CPU?
                            
                                Most efficient way to find the greatest of three ints
                            
                                Why Thread.Sleep() is so CPU intensive?
                            
                                How to increase the startup speed of the delphi app?
                            
                                Using properties and performance
                            
                                Javascript performance of Array.map
                            
                                Bad performance on Azure for Owin/IIS application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With