Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ AMP with fast GPUs slower than CPU

I'm just starting to learn C++ AMP and I've obtained a few examples that I've built with the VS 2012 RC, but I'm finding that the performance of the GPU is slower than the CPU. For instance, the examples by Kate Gregory: http://ampbook.codeplex.com/releases/view/90595 (relevant to her upcoming book http://www.gregcons.com/cppamp/). They were demonstrated by her in a lecture I watched where she obtained a ~5x performance improvement for the chapter 4 example by using her laptop's GPU (I believe she said it was a 6650) compared to CPU (not sure what CPU she had). I've tried testing the example myself and on a couple of system configurations (as below) I've always found the CPU to be faster. I've also tested other examples and found the same. Am I doing something wrong? Is there a reason for the slower than expected performance? Does anyone have an example that would definitely show the GPU being faster?

  • System 1: Intel i7 2600K with onboard graphics (I expect this to be slower)
  • System 2: Intel i7 2630QM with Intel HD switchable with AMD 6770 (I have it running in performance mode so it should be using the 6770)
  • System 3: Intel i5 750 with 2xCrossfire AMD HD 5850

Example of results: chapter4 project results in 1.15ms CPU, 2.57ms GPU, 2.55ms GPU tiled.

Edit:

Doh, I think I just found the reason why - the values for the size of the matrices she used in the lecture were different. The sample on the website uses M=N=W=64. If I use 64, 512 and 256 as she did in the lecture then I get the corresponding ~5x increase in performance.

like image 592
CarbonTwelve Avatar asked Aug 06 '12 21:08

CarbonTwelve


1 Answers

It seems like your overarching question is WHY moving things to the GPU doesn't always get you a benefit. The answer is copy time. Imagine a calculation that takes a time proprotional to n squared. Copying takes a time proportional to n. You might need quite a large n before spending the time to copy to and from the GPU is outweighed by the time saved doing the calculation there.

The book mentions this briefly in the early chapters, and Chapters 7 and 8 are all about performance and optimization. Chapter 7 is on Rough Cuts now; Chapter 8 should be there shortly. (Its code is already on Codeplex - the Reduction case study.)

I've just checked in an update to the Chapter 4 code that uses the Tech Ed starting numbers instead of the ones that were there before. Smaller matrices lose too much time to the copy to/from the GPU; larger ones take too long to be a good demo. But do feel free to play around with the sizes. Make them even larger since you don't mind a minute or two of "dead air", and see what happens.

like image 107
Kate Gregory Avatar answered Sep 29 '22 11:09

Kate Gregory