__global__ void helloCUDA(float f)
{
printf("Hello thread %d, f=%f\n", threadIdx.x, f);
}
int main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaDeviceSynchronize();
return 0;
}
Why is cudaDeviceSynchronize(); at many places for example here it is not required after kernel call?
cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).
Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.
In CUDA, kernel launches are asynchronous (often called “non-blocking”). An example of kernel execution from host perspective: 1. Host call starts the kernel execution.
Under dynamic parallelism, one kernel may launch another kernel, and that kernel may launch another, and so on. Each subordinate launch is considered a new “nesting level,” and the total number of levels is the “nesting depth” of the program.
A kernel launch is asynchronous. This means it returns control to the CPU thread immediately after starting up the GPU process, before the kernel has finished executing.
So what is the next thing in the CPU thread here? Application exit.
At application exit, it's ability to send output to the standard output is terminated by the OS.
Thus the output that is generated later by the kernel has nowhere to go, and you won't see it.
On the other hand, if you use cudaDeviceSynchronize()
, then the kernel is guaranteed to finish (and the output from the kernel will find a waiting standard output queue), before the application is allowed to exit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With