I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy() Blocks the CPU until the copy is complete. Copy begins when all preceding CUDA calls have completed. cudaMemcpyAsync() Asynchronous, does not block the CPU.
Most CUDA calls are synchronous (often called “blocking”). An example of a blocking call is cudaMemcpy().
malloc() allocates dynamic memory on host i.e. on cpu. Allocating global memory on device you need to call cudaMalloc(). To operate on data using gpu your hole data needs to transfer on global memory. cudaMalloc() only allocates memory, it'll not copies your data on device memory.
The CUDA runtime makes it possible to compile and link your CUDA kernels into executables. This means that you don't have to distribute cubin files with your application, or deal with loading them through the driver API. As you have noted, it is generally easier to use.
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With