I have:
cudaHostAlloc(..., cudaHostAllocMapped) or cudaHostRegister(..., cudaHostRegisterMapped);cudaHostGetDevicePointer(...).I initiate cudaMemcpy(..., cudaMemcpyDeviceToDevice) on src and dest device pointers that point to two different regions of pinned+mapped memory obtained by the technique above.
Everything works fine.
Question: should I continue doing this or just use a traditional CPU-style memcpy() since everything is in system memory anyway? ...or are they the same (i.e. does cudaMemcpy map to a straight memcpy when both src and dest are pinned)?
(I am still using the cudaMemcpy method because previously everything was in device global memory, but have since switched to pinned memory due to gmem size constraints)
With cudaMemcpy the CUDA driver detects that you are copying from a host pointer to a host pointer and the copy is done on the CPU. You can of course use memcpy on the CPU yourself if you prefer.
If you use cudaMemcpy, there may be an extra stream synchronize performed before doing the copy (which you may see in the profiler, but I'm guessing there—test and see).
On a UVA system you can just use cudaMemcpyDefault as talonmies says in his answer. But if you don’t have UVA (sm_20+ and 64-bit OS), then you have to call the right copy (e.g. cudaMemcpyDeviceToDevice). If you cudaHostRegister() everything you are interested in then cudaMemcpyDeviceToDevice will end up doing the following depending on the where the memory is located:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With