For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code).
When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment.
Things I did to find out more about the cause of this problem:
Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.
Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference.
Avg. Time Batch Preparation: 0.001 seconds
Avg. Time Train Operation: 0.021 seconds
Avg. Time Total per Batch: 0.022 seconds (45.18 batches/second)
Run on multiple machines to rule out hardware issues.
Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.
An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py
Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI
If you're getting less than 80-90% GPU usage in demanding games, you most likely have a CPU bottleneck. The CPU has to feed data to the GPU. Your GPU has nothing to work on if the CPU can't send enough data. This problem shows up when you pair a powerful graphics card with a low-end CPU.
TensorFlow runs up to 50% faster on the latest Pascal GPUs and scales well across GPUs. Now you can train the models in hours instead of days.
With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard.
These statistics include compute time and memory usage for each node in the computation graph.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With