Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow: Graph Optimization (GPU vs CPU Performance)

This issue was originally posted on Github #3320. It would be good to start there as there is more detail on the original problem in that thread and bulky so I don't wish to re-post on StackOverflow. A summary of the issue is performance is slower when using the GPU than the CPU to process the TensorFlow Graph. CPU/GPU Timelines (debugging) are included for evaluation. One of the comments back was related to optimizing the Graph to speed processing with a request for a toy example to discuss. The "Original Solution" is my reinforcement learning code that showed slow performance and created a few Published Codes for community discussion and evaluation.

I have enclosed the test scripts as well as some of the raw data, Trace Files & TensorBoard log files to speed up any review. CPUvsGPU testing.zip

The discussion was moved to StackOverflow as this topic would benefit all Tensorflow users. What I am hoping to discover are ways to optimizes the performance of the published graph. The issue of GPU vs CPU can be separated out as it might be solved with a more efficient TensorFlow Graph.

What I did was to take my Original Solution and stripped out the "Game Environment". I replaced it with a random data generation. In this Game Environment, there is no creation/modification of the TensorFlow Graph. The structure closely follows/leverages nivwusquorum's Github Reinforcement Learning Example.

On 7/15/2016 I did a "git pull" to head for Tensorflow. I executed the Graph with and without the GPU enabled and recorded the times (see attached chart). The unexpected result is the GPU outperformed the CPU (which is the initial expectation that wasn't met). So this code "cpuvsgpu.py" with the supporting libraries performs better with the GPU. So I turned my attention to what may be different between my Original Solution and the published code. I also update the head to 7/17/2016. Something did improve as the overall difference between the CPU & GPU on the Original Solution is much closer than a week again where I was seeing 47s CPU vs 71s GPU . A quick look at the new Traces vs my initial trace, seems like "summary's" may have been changed but there may have been other improvements as well.

gtx 950 timing

I tried 2 other combinations to better reflect how the Original Solution functioned. Those were under heavy CPU load (~60% - 70%) and simulated that with concurrent execution of that script. The other variation was to increase the "Data IO", the Original Solution uses lists of observations to randomly select observations for training. This list has a fixed upper limit and then starts deleting the first item in the list while appending the new. I figured maybe one of these was slowing down streaming of data to the GPU. Unfortunately, neither of these version caused the CPU to outperform the GPU. I also ran a quick GPUTESTER app which does large matrix multiplication to get a feel for timing differences with size of task and are as expected.

I would really like to know how to improve this graph and reduce the number of small OPS. It seems like this is where most of the performance may be going. It would be nice to learn any tricks to combine smaller ops into bigger ones without impacting the logic (function) of the graph.

like image 381
mazecreator Avatar asked Jul 31 '16 22:07

mazecreator


People also ask

Is TensorFlow better on CPU or GPU?

They noticed that the performance of TensorFlow depends significantly on the CPU for a small-size dataset. Also, they found it is more important to use a graphic processing unit (GPU) when training a large-size dataset.

How much faster is GPU than CPU for deep learning?

GPU vs CPU Performance in Deep Learning Models Generally speaking, GPUs are 3X faster than CPUs.

Is TensorFlow GPU faster?

TensorFlow runs up to 50% faster on the latest Pascal GPUs and scales well across GPUs.


1 Answers

ResultsThanks for the excellent post.

I am experiencing a similar issue: GPU/CPU processing takes more CPU and elapsed time than CPU processing alone for two examples provided by TensorFlow: The linear regression loss model, and the MNIST for Beginners, while the MNIST Deep script shows significant improvement in CPU and Elapsed when using the GPU Profiling GPU and CPU Performance page 10 starts the discussion.

Here are the numbers:

workload     | win 8.1   win 8.1   win8.1     win 10    win 10    win 10  
workload     | cpu only  cpu       gpu        cpu only  cpu       gpu      
-------------+-----------------------------------------------------------
mnist deep   | 14053     384.26   328.92      12406     289.28   211.79 
mnist deep   | 14044     384.59   328.45      12736     293.71   210.48
mnist10,000  | 24.10      45.85     7.67      26.56      44.42     7.32  
mnist10,000  | 23.94      44.98     7.56      25.80      44.24     7.32  
mnist50,000  | 95.49     198.12    38.26     109.99     197.82    36.15  
mnist50,000  | 96.07     197.86    37.91     109.46     195.39    39.44  
   lr10,000  |  6.23      15.08     1.78       7.38      16.79     1.91  
   lr10,000  |  6.33      15.23     1.78       7.44      16.59     1.91  
  lr100,000  | 48.31     124.37    17.67      62.14     148.81    19.04  
  lr100,000  | 48.97     123.35    17.63      61.40     147.69    18.72  

( Source: Profiling GPU and CPU Performance, Fig. 64 Results )

like image 87
djyredhat Avatar answered Oct 17 '22 23:10

djyredhat