I have installed the CUDA runtime and drivers version 7.0 to my workstation (Ubuntu 14.04, 2xIntel XEON e5 + 4x Tesla k20m). I've used the following program to check whether my installation works:
#include <stdio.h>
__global__ void helloFromGPU()
{
printf("Hello World from GPU!\n");
}
int main(int argc, char **argv)
{
printf("Hello World from CPU!\n");
helloFromGPU<<<1, 1>>>();
printf("Hello World from CPU! Again!\n");
cudaDeviceSynchronize();
printf("Hello World from CPU! Yet again!\n");
return 0;
}
I get the correct output, but it's taken an enourmus amount of time:
$ nvcc hello.cu -O2
$ time ./hello > /dev/null
real 0m8.897s
user 0m0.004s
sys 0m1.017s`
If I remove all device code the overall execution takes 0.001s. So why does my simple program almost take 10 seconds?
The apparent slow runtime of your example is due to the underlying fixed cost of setting up the GPU context.
Because you are running on a platform that supports unified addressing, the CUDA runtime has to map 64GB of host RAM and 4 x 5120MB from your GPUs into a single virtual address space and register that with the Linux kernel.
There are a lot of kernel API calls required to do that, and it isn't fast. I would guess that is the main source of the slow performance you are observing. You should view this as a fixed start-up cost which must be amortised over the life of your application. In real world applications, a 10 second startup is trivial and of no real importance. In a hello world example, it isn't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With