I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
I am trying to run a kernel where my input data is more than the amount available on the GPU.
Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?
I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?
Is this typical usage scenario correct for using cudaMallocHost:
So - no copy has to happy between step 4 and 5 right?
if that is correct, then I can see the advantage for kernels that will fit on the GPU all at once at least
Memory management on a CUDA device is similar to how it is done in CPU programming. You need to allocate memory space on the host, transfer the data to the device using the built-in API, retrieve the data (transfer the data back to the host), and finally free the allocated memory.
Coalesced memory access or memory coalescing refers to combining multiple memory accesses into a single transaction. On the K20 GPUs on Stampede, every successive 128 bytes ( 32 single precision words) memory can be accessed by a warp (32 consecutive threads) in a single transaction.
The constant memory in CUDA is a dedicated memory space of 65536 bytes. It is dedicated because it has some special features like cache and broadcasting. The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 1.
Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost
can do two things:
cudaMemcpy
as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.cudaDeviceMapHost
flag using cudaSetDeviceFlags
before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.
If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync
by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.
I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost
type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With