I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.
As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf
Peer-to-Peer Memcpy
Direct copy from pointer on GPU A to pointer on GPU B
With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
Or cudaMemcpyAsync(…, cudaMemcpyDefault)
Also non-UVA explicit P2P copies:
cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src,
int srcDevice, size_t count )
cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )
cudaMemcpy()
then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )
? cudaMemcpy()
pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)
? cudaMemcpyPeer()
, and if no any advantage, why it is needed?Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.
Peer-to-peer memcpy with UVA*
When UVA is possible, then cudaMemcpy
can be used for peer-to-peer memcpy
since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy
with UVA are the following:
//Check for peer access between participating GPUs:
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);
Peer-to-peer memcpy without UVA
When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer
. Here is an example
// Set device 0 as current
cudaSetDevice(0);
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size);
// Set device 1 as current
cudaSetDevice(1);
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size);
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0);
// Set device 1 as current
cudaSetDevice(1);
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size);
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);
As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.
The instruction
cudaSetDeviceFlags(cudaDeviceMapHost);
is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.
In conclusion, the answer to your questions are:
cudaMemcpy
(you don't need to specify the devices); otherwise, use cudaMemcpyPeer
(and you need to specify the devices).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With