I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.
CUDA Cores are used for a lot of things, but the main thing they're used for is to enable efficient parallel computing. A single CUDA core is similar to a CPU core, with the primary difference being that it is less capable but implemented in much greater numbers. Which again allows for great parallel computing.
You can use the compute visula profiler which will be great for your purpose. it measures the time of every cuda function and tells you how many times you called it .
To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance. Copy the results from device memory to host memory, also called device-to-host transfer.
CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024) #define M (1000000) int main() { float data[N]; int count = 0; for(int i = 0; i < N; i++) { data[i] = 1.0f * i / N; for(int j = 0; j < M; j++) { data[i] = data[i] * data[i] - 0.25f; } } int sel; printf("Enter an index: "); scanf("%d", &sel); printf("data[%d] = %f\n", sel, data[sel]); }
Use something like this in CUDA/C:
#define N (1024*1024) #define M (1000000) __global__ void cudakernel(float *buf) { int i = threadIdx.x + blockIdx.x * blockDim.x; buf[i] = 1.0f * i / N; for(int j = 0; j < M; j++) buf[i] = buf[i] * buf[i] - 0.25f; } int main() { float data[N]; int count = 0; float *d_data; cudaMalloc(&d_data, N * sizeof(float)); cudakernel<<<N/256, 256>>>(d_data); cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_data); int sel; printf("Enter an index: "); scanf("%d", &sel); printf("data[%d] = %f\n", sel, data[sel]); }
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With