Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

Tags:

g++

cuda

nvcc

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.

To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

like image 964
Chris Redford Avatar asked Oct 05 '11 14:10

Chris Redford


People also ask

Is CUDA better than CPU?

CUDA Cores are used for a lot of things, but the main thing they're used for is to enable efficient parallel computing. A single CUDA core is similar to a CPU core, with the primary difference being that it is less capable but implemented in much greater numbers. Which again allows for great parallel computing.

How does CUDA calculate execution time?

You can use the compute visula profiler which will be great for your purpose. it measures the time of every cuda function and tells you how many times you called it .

What are the three general section of CUDA program?

To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance. Copy the results from device memory to host memory, also called device-to-host transfer.

Is CUDA C or C++?

CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.


2 Answers

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

Use something like this in C++:

  #define N (1024*1024)   #define M (1000000)   int main()   {      float data[N]; int count = 0;      for(int i = 0; i < N; i++)      {         data[i] = 1.0f * i / N;         for(int j = 0; j < M; j++)         {            data[i] = data[i] * data[i] - 0.25f;         }      }      int sel;      printf("Enter an index: ");      scanf("%d", &sel);      printf("data[%d] = %f\n", sel, data[sel]);   } 

Use something like this in CUDA/C:

  #define N (1024*1024)   #define M (1000000)    __global__ void cudakernel(float *buf)   {      int i = threadIdx.x + blockIdx.x * blockDim.x;      buf[i] = 1.0f * i / N;      for(int j = 0; j < M; j++)         buf[i] = buf[i] * buf[i] - 0.25f;   }    int main()   {      float data[N]; int count = 0;      float *d_data;      cudaMalloc(&d_data, N * sizeof(float));      cudakernel<<<N/256, 256>>>(d_data);      cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);      cudaFree(d_data);        int sel;      printf("Enter an index: ");      scanf("%d", &sel);      printf("data[%d] = %f\n", sel, data[sel]);   } 

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

like image 55
Patrick87 Avatar answered Sep 18 '22 23:09

Patrick87


A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)

Are you looking for example code, or just ideas?

Edit

The 40x was vs a dual core CPU, not a quad core.

Some pointers:

  • Make sure you're not running, say, Crysis while running your benchmarks.
  • Shot down all unnecessary apps and services that might be stealing CPU time.
  • Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)

As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.

(These are probably pretty obvious in retrospect.)

like image 26
3Dave Avatar answered Sep 19 '22 23:09

3Dave