Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pinned memory in multiple GPUs

I am using 4 GPUs and to speed up the memory transfer I am trying to use pinned memory using cudaHostAlloc().

The main UI thread(mfc base) creates 4 threads and each thread calls cudaSetDevice(nDeviceID).

Here is my question. Can I call cudaHostAlloc() at the main thread and give the pointer as a lParam or I have to call it in each branch thread after calling the cudaSetDevice(nDeviceID)?

Here is the pseudo code.

1) Calling cudaHostAlloc at the main thread

Main thread

cudaHostAlloc((void**)h_arrfBuf, size*sizeof(float), cudaHostAllocDefault);
AcqBuf(h_arrfBuf, size);
for i =1:4
    ST_Param* pstParam = new ST_Param(i, size/4, h_arrfBuf);
    AfxBeginThread(Calc, pstParam );

Branch thread

UINT Calc(LPVOID lParam)
    ST_Param pstParam = reinterpret_cast<ST_Param*>(lParam);
    cudaSetDevice(pstParam->nDeviceID);
    Cudafunc(pstParam->size/4, pstParam->h_arrfBuf+(pstParam->nDeviceID-1)*size/4);

2) Calling cudaHostAlloc at the branch threads

Main thread

AcqBuf(arrfRaw, size);
for i =1:4
    ST_Param* pstParam = new ST_Param(i, size/4, arrfRaw + (i-1)*size/4);
    AfxBeginThread(Calc, pstParam);

Branch thread

UINT Calc(LPVOID lParam)
    ST_Param pstParam = reinterpret_cast<ST_Param*>(lParam);
    cudaSetDevice(pstParam->nDeviceID);
    cudaHostAlloc((void**)h_arrfBuf, size/4*sizeof(float), cudaHostAllocDefault);
    memcpy(h_arrfBuf, pstParam->arrfRaw, size/4*sizeof(float));
    Cudafunc(pstParam->size/4, h_arrfBuf);

What I am basically curious about is whether pinned memory is device specific or not.

like image 475
MINSUK LE Avatar asked Mar 09 '26 05:03

MINSUK LE


1 Answers

Since CUDA 4.0, the runtime API is intrinsically thread safe, and a context on any given GPU is automatically shared amongst every host thread within a given application (see here).

Further, quoting from the relevant documentation:

When the application is run as a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. As a consequence:

....

  • Allocations via cudaHostAlloc() are automatically portable (see Portable Memory) across all the devices for which the unified address space is used, and pointers returned by cudaHostAlloc() can be used directly from within kernels running on these devices (i.e., there is no need to obtain a device pointer via cudaHostGetDevicePointer() as described in Mapped Memory.

So if your GPUs and platform support unified virtual addressing, then a pinned/mapped host memory is automatically portable to all devices within that address space, and each GPU context is automatically portable across each host thread. So you should be safe doing the complete pinned memory setup from a single host thread, given all the constraints described above.

like image 157
3 revstalonmies Avatar answered Mar 12 '26 01:03

3 revstalonmies



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!