TensorFlow: Graph Optimization (GPU vs CPU Performance)

Tags:

This issue was originally posted on Github #3320. It would be good to start there as there is more detail on the original problem in that thread and bulky so I don't wish to re-post on StackOverflow. A summary of the issue is performance is slower when using the GPU than the CPU to process the TensorFlow Graph. CPU/GPU Timelines (debugging) are included for evaluation. One of the comments back was related to optimizing the Graph to speed processing with a request for a toy example to discuss. The "Original Solution" is my reinforcement learning code that showed slow performance and created a few Published Codes for community discussion and evaluation.

I have enclosed the test scripts as well as some of the raw data, Trace Files & TensorBoard log files to speed up any review. CPUvsGPU testing.zip

The discussion was moved to StackOverflow as this topic would benefit all Tensorflow users. What I am hoping to discover are ways to optimizes the performance of the published graph. The issue of GPU vs CPU can be separated out as it might be solved with a more efficient TensorFlow Graph.

What I did was to take my Original Solution and stripped out the "Game Environment". I replaced it with a random data generation. In this Game Environment, there is no creation/modification of the TensorFlow Graph. The structure closely follows/leverages nivwusquorum's Github Reinforcement Learning Example.

On 7/15/2016 I did a "git pull" to head for Tensorflow. I executed the Graph with and without the GPU enabled and recorded the times (see attached chart). The unexpected result is the GPU outperformed the CPU (which is the initial expectation that wasn't met). So this code "cpuvsgpu.py" with the supporting libraries performs better with the GPU. So I turned my attention to what may be different between my Original Solution and the published code. I also update the head to 7/17/2016. Something did improve as the overall difference between the CPU & GPU on the Original Solution is much closer than a week again where I was seeing 47s CPU vs 71s GPU . A quick look at the new Traces vs my initial trace, seems like "summary's" may have been changed but there may have been other improvements as well.

gtx 950 timing

I tried 2 other combinations to better reflect how the Original Solution functioned. Those were under heavy CPU load (~60% - 70%) and simulated that with concurrent execution of that script. The other variation was to increase the "Data IO", the Original Solution uses lists of observations to randomly select observations for training. This list has a fixed upper limit and then starts deleting the first item in the list while appending the new. I figured maybe one of these was slowing down streaming of data to the GPU. Unfortunately, neither of these version caused the CPU to outperform the GPU. I also ran a quick GPUTESTER app which does large matrix multiplication to get a feel for timing differences with size of task and are as expected.

I would really like to know how to improve this graph and reduce the number of small OPS. It seems like this is where most of the performance may be going. It would be nice to learn any tricks to combine smaller ops into bigger ones without impacting the logic (function) of the graph.

381

asked Jul 31 '16 22:07

mazecreator

1 Answers

Results Thanks for the excellent post.

I am experiencing a similar issue: GPU/CPU processing takes more CPU and elapsed time than CPU processing alone for two examples provided by TensorFlow: The linear regression loss model, and the MNIST for Beginners, while the MNIST Deep script shows significant improvement in CPU and Elapsed when using the GPU Profiling GPU and CPU Performance page 10 starts the discussion.

Here are the numbers:

workload     | win 8.1   win 8.1   win8.1     win 10    win 10    win 10  
workload     | cpu only  cpu       gpu        cpu only  cpu       gpu      
-------------+-----------------------------------------------------------
mnist deep   | 14053     384.26   328.92      12406     289.28   211.79 
mnist deep   | 14044     384.59   328.45      12736     293.71   210.48
mnist10,000  | 24.10      45.85     7.67      26.56      44.42     7.32  
mnist10,000  | 23.94      44.98     7.56      25.80      44.24     7.32  
mnist50,000  | 95.49     198.12    38.26     109.99     197.82    36.15  
mnist50,000  | 96.07     197.86    37.91     109.46     195.39    39.44  
   lr10,000  |  6.23      15.08     1.78       7.38      16.79     1.91  
   lr10,000  |  6.33      15.23     1.78       7.44      16.59     1.91  
  lr100,000  | 48.31     124.37    17.67      62.14     148.81    19.04  
  lr100,000  | 48.97     123.35    17.63      61.40     147.69    18.72

( Source: Profiling GPU and CPU Performance, Fig. 64 Results )

answered Oct 17 '22 23:10

djyredhat

Related questions
                            
                                Wrapping Tensorflow For Use in Keras
                            
                                How do I know if tensorflow using cuda and cudnn or not?
                            
                                ModuleNotFoundError: No module named 'tensorflow.python.training'
                            
                                Tensorflow quantization
                            
                                Is It Possible to the Take the Mode of a Tensor in Tensorflow?
                            
                                Multi-output regression model always returns the same value for a batch in Tensorflow
                            
                                Tensorflow - How to use the GPU instead of a CPU for tf.Estimator() CNNs
                            
                                Import tensorflow error: terminate called after throwing an instance of 'Xbyak::Error'
                            
                                TensorFlow InternalError: Unable to get element as bytes
                            
                                Updating Tensorflow Object detection model with new images
                            
                                How does tf.layers.dense() interact with inputs of higher dim?
                            
                                How to use libtensorflow-lite.a on Raspi 3?
                            
                                Keras: update model with a bigger training set
                            
                                Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize
                            
                                Inference with TensorRT .engine file on python
                            
                                Input 0 of layer lstm_5 is incompatible with the layer: expected ndim=3, found ndim=2
                            
                                Tensorflow automl model in react
                            
                                How can I feed last output y(t-1) as input for generating y(t) in tensorflow RNN?
                            
                                Tensorflow 0.8 Import and Export output tensors problems
                            
                                Tensorflow: Convert Tensor to numpy array WITHOUT .eval() or sess.run()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TensorFlow: Graph Optimization (GPU vs CPU Performance)

Tags:

tensorflow

reinforcement-learning

mazecreator

People also ask

1 Answers

djyredhat

Recent Activity

Donate For Us