I wanted to compare the speed of a single Intel CPU core with the speed of an single nVidia GPU core (ie: a single CUDA code, a single thread). I did implement the following naive 2d image convolution algorithm:
void convolution_cpu(uint8_t* res, uint8_t* img, uint32_t img_width, uint32_t img_height, uint8_t* krl, uint32_t krl_width, uint32_t krl_height)
{
int32_t center_x = krl_width / 2;
int32_t center_y = krl_height / 2;
int32_t sum;
int32_t fkx,fky;
int32_t xx,yy;
float krl_sum = 0;
for(uint32_t i = 0; i < krl_width*krl_height; ++i)
krl_sum += krl[i];
float nc = 1.0f/krl_sum;
for(int32_t y = 0; y < (int32_t)img_height; ++y)
{
for(int32_t x = 0; x < (int32_t)img_width; ++x)
{
sum = 0;
for(int32_t ky = 0; ky < (int32_t)krl_height; ++ky)
{
fky = krl_height - 1 - ky;
for(int32_t kx = 0; kx < (int32_t)krl_width; ++kx)
{
fkx = krl_width - 1 - kx;
yy = y + (ky - center_y);
xx = x + (kx - center_x);
if( yy >= 0 && yy < (int32_t)img_height && xx >= 0 && xx < (int32_t)img_width )
{
sum += img[yy*img_width+xx]*krl[fky*krl_width+fkx];
}
}
}
res[y*img_width+x] = sum * nc;
}
}
}
The algorithm is the same for both CPU and GPU. I also made another GPU version which is almost the same with the above. The only difference is that I am transferring the img
and krl
arrays to the shared memory before using them.
I used 2 images of dimensions 52x52 each and I got the following performance:
The CPU is an Intel Xeon X5650 2.67GHz and the GPU is an nVidia Tesla C2070.
Why do I get such a performance difference? It looks like a single CUDA core is 100 times slower for this particular code! Could someone explain to me why? The reasons I can think of is
What you think is the major issue that is causing this huge performance difference?
Keep in mind that I want to compare the speed between a single CPU thread and a single GPU thread. I am not trying to evaluate GPU's computing performance. I am aware that this is not the right way to do convolution on the GPU.
I am trying to explain, may be it will work for you.
CPU acts as host and GPU acts as device.
To run thread on GPU, CPU copies all data (Computation + DATA on which computation will be performed on) to GPU. This copying time is always greater than computation time. Because computation is performed in ALU- Arithmetic and Logical Unit. Which is Some instructions only.But copying takes more time.
So when you run only one thread in CPU, CPU has all data in its own memory, having its cache and also branch prediction, prefetch, micro-op re-ordering, 10x faster L1, 10x faster L2, ability to dispatch 6x more instructions per cycle, 4.6x faster core frequency.
But when it comes that you want to run the thread on GPU, it first copies data on GPU memory. This time take more time. Secondly, GPU cores runs grid of threads in a clock cycle.But for that we need to partition the data such that each thread gets access to one item of array. In your example it is img and krl arrays.
There is also a profiler available for nvidia GPUs. Remove codes such as printout or print in your code if they exist and try profiling your exe. It will show you copying time and computation time both in ms.
Loop Parallelization : When you run two loops to compute your image using image_width and image_height it takes more clock cycles to be performed as in instruction level it runs through counters. But when you port them on GPU, you use threadid.x and threadid.y and the grid of 16 or 32 threads that run in only one clock cycle in one core of GPU. This means it computes 16 or 32 array items in one clock cycle as it has more ALUs.(if no dependencty is there and data is partitioned well)
In your convolution algorithm you have maintained loops in CPUs, but in GPUs if you run the same loops than it will not benefit because GPU 1 thread will again act as CPU 1 thread. And also overhead of memory caches,memory copying,data partitioning etc.
I hope this will make you understand...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With