I am using 4 GPUs and to speed up the memory transfer I am trying to use pinned memory using cudaHostAlloc().
The main UI thread(mfc base) creates 4 threads and each thread calls cudaSetDevice(nDeviceID).
Here is my question. Can I call cudaHostAlloc() at the main thread and give the pointer as a lParam or I have to call it in each branch thread after calling the cudaSetDevice(nDeviceID)?
Here is the pseudo code.
1) Calling cudaHostAlloc at the main thread
Main thread
cudaHostAlloc((void**)h_arrfBuf, size*sizeof(float), cudaHostAllocDefault);
AcqBuf(h_arrfBuf, size);
for i =1:4
ST_Param* pstParam = new ST_Param(i, size/4, h_arrfBuf);
AfxBeginThread(Calc, pstParam );
Branch thread
UINT Calc(LPVOID lParam)
ST_Param pstParam = reinterpret_cast<ST_Param*>(lParam);
cudaSetDevice(pstParam->nDeviceID);
Cudafunc(pstParam->size/4, pstParam->h_arrfBuf+(pstParam->nDeviceID-1)*size/4);
2) Calling cudaHostAlloc at the branch threads
Main thread
AcqBuf(arrfRaw, size);
for i =1:4
ST_Param* pstParam = new ST_Param(i, size/4, arrfRaw + (i-1)*size/4);
AfxBeginThread(Calc, pstParam);
Branch thread
UINT Calc(LPVOID lParam)
ST_Param pstParam = reinterpret_cast<ST_Param*>(lParam);
cudaSetDevice(pstParam->nDeviceID);
cudaHostAlloc((void**)h_arrfBuf, size/4*sizeof(float), cudaHostAllocDefault);
memcpy(h_arrfBuf, pstParam->arrfRaw, size/4*sizeof(float));
Cudafunc(pstParam->size/4, h_arrfBuf);
What I am basically curious about is whether pinned memory is device specific or not.
Since CUDA 4.0, the runtime API is intrinsically thread safe, and a context on any given GPU is automatically shared amongst every host thread within a given application (see here).
Further, quoting from the relevant documentation:
When the application is run as a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. As a consequence:
....
- Allocations via
cudaHostAlloc()are automatically portable (see Portable Memory) across all the devices for which the unified address space is used, and pointers returned bycudaHostAlloc()can be used directly from within kernels running on these devices (i.e., there is no need to obtain a device pointer viacudaHostGetDevicePointer()as described in Mapped Memory.
So if your GPUs and platform support unified virtual addressing, then a pinned/mapped host memory is automatically portable to all devices within that address space, and each GPU context is automatically portable across each host thread. So you should be safe doing the complete pinned memory setup from a single host thread, given all the constraints described above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With