I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable. To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way. Use something like this in C++: <pre class="prettyprint"><code> #define N (1024*1024) #define M (1000000) int main() { float data[N]; int count = 0; for(int i = 0; i < N; i++) { data[i] = 1.0f * i / N; for(int j = 0; j < M; j++) { data[i] = data[i] * data[i] - 0.25f; } } int sel; printf("Enter an index: "); scanf("%d", &sel); printf("data[%d] = %f\n", sel, data[sel]); } </code></pre> Use something like this in CUDA/C: <pre class="prettyprint"><code> #define N (1024*1024) #define M (1000000) __global__ void cudakernel(float *buf) { int i = threadIdx.x + blockIdx.x * blockDim.x; buf[i] = 1.0f * i / N; for(int j = 0; j < M; j++) buf[i] = buf[i] * buf[i] - 0.25f; } int main() { float data[N]; int count = 0; float *d_data; cudaMalloc(&d_data, N * sizeof(float)); cudakernel<<<N/256, 256>>>(d_data); cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_data); int sel; printf("Enter an index: "); scanf("%d", &sel); printf("data[%d] = %f\n", sel, data[sel]); } </code></pre> If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.) Are you looking for example code, or just ideas? Edit The 40x was vs a dual core CPU, not a quad core. Some pointers: <ul> <li>Make sure you're not running, say, Crysis while running your benchmarks.</li> <li>Shot down all unnecessary apps and services that might be stealing CPU time.</li> <li>Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)</li> </ul> As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it. (These are probably pretty obvious in retrospect.)

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

Tags:

g++

cuda

nvcc

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.

To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

964

asked Oct 05 '11 14:10

Chris Redford

2 Answers

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

Use something like this in C++:

  #define N (1024*1024)   #define M (1000000)   int main()   {      float data[N]; int count = 0;      for(int i = 0; i < N; i++)      {         data[i] = 1.0f * i / N;         for(int j = 0; j < M; j++)         {            data[i] = data[i] * data[i] - 0.25f;         }      }      int sel;      printf("Enter an index: ");      scanf("%d", &sel);      printf("data[%d] = %f\n", sel, data[sel]);   }

Use something like this in CUDA/C:

  #define N (1024*1024)   #define M (1000000)    __global__ void cudakernel(float *buf)   {      int i = threadIdx.x + blockIdx.x * blockDim.x;      buf[i] = 1.0f * i / N;      for(int j = 0; j < M; j++)         buf[i] = buf[i] * buf[i] - 0.25f;   }    int main()   {      float data[N]; int count = 0;      float *d_data;      cudaMalloc(&d_data, N * sizeof(float));      cudakernel<<<N/256, 256>>>(d_data);      cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);      cudaFree(d_data);        int sel;      printf("Enter an index: ");      scanf("%d", &sel);      printf("data[%d] = %f\n", sel, data[sel]);   }

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

answered Sep 18 '22 23:09

Patrick87

A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)

Are you looking for example code, or just ideas?

Edit

The 40x was vs a dual core CPU, not a quad core.

Some pointers:

Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)

As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.

(These are probably pretty obvious in retrospect.)

answered Sep 19 '22 23:09

3Dave

Related questions
                            
                                From thrust::device_vector to raw pointer and back?
                            
                                NVidia CUDA toolkit 7.5.27 failing to install on OS X
                            
                                Difference with CUDA Hardware Quadro 4000 Vs. GeForce 480
                            
                                Have you successfully used a GPGPU? [closed]
                            
                                help me understand cuda
                            
                                How is CUDA memory managed?
                            
                                Is there a maximum number of streams in CUDA?
                            
                                How to find cuda version in ubuntu?
                            
                                Explanation of CUDA C and C++
                            
                                Using CUDA with Visual Studio 2017
                            
                                Why can't libcudart.so.4 be found when compiling the CUDA samples under Ubuntu?
                            
                                CUDA / OpenCL within a Virtual Machine / Hypervisor [closed]
                            
                                CUDA for .net?
                            
                                openacc vs openmp & mpi differences ?
                            
                                CUDA List of atomic operations
                            
                                How do I start a new CUDA project in Visual Studio 2008?
                            
                                What is the difference between CUDA core and CPU core?
                            
                                Copying a struct containing pointers to CUDA device
                            
                                CUDA_HOME path for Tensorflow
                            
                                What is CUDA like? What is it for? What are the benefits? And how to start?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With