Default Pinned Memory Vs Zero-Copy Memory

Tags:

cuda

In CUDA we can use pinned memory to more efficiently copy the data from Host to GPU than the default memory allocated via malloc at host. However there are two types of pinned memories the default pinned memory and the zero-copy pinned memory.

The default pinned memory copies the data from Host to GPU twice as fast as the normal transfers, so there's definitely an advantage (provided we have enough host memory to page-lock)

In the different version of pinned memory, i.e. zero-copy memory, we don't need to copy the data from host to GPU's DRAM altogether. The kernels read the data directly from the Host memory.

My question is: Which of these pinned-memory type is a better programming practice.

489

asked Mar 06 '11 07:03

jwdmsd

2 Answers

I think it depends on your application (otherwise, why would they provide both ways?)

Mapped, pinned memory (zero-copy) is useful when either:

The GPU has no memory on its own and uses RAM anyway
You load the data exactly once, but you have a lot of computation to perform on it and you want to hide memory transfer latencies through it.
The host side wants to change/add more data, or read the results, while kernel is still running (e.g. communication)
The data does not fit into GPU memory

Note that, you can also use multiple streams to copy data and run kernels in parallel.

Pinned, but not mapped memory is better:

When you load or store the data multiple times. For example: you have multiple subsequent kernels, performing the work in steps - there is no need to load the data from host every time.
There is not that much computation to perform and loading latencies are not going to be hidden well

answered Oct 05 '22 16:10

CygnusX1

Mapped pinned memory is identical to other types of pinned memory in all respects, except that it is mapped into the CUDA address space, so can be read and written by CUDA kernels as well as used for DMA transfers by the Copy Engines.

The advantage to not mapping pinned memory was twofold: it saved you some address space, which can be a precious commodify in a world of 32-bit platforms with GPUs that can hold 3-4G of RAM. Also, memory that is not mapped cannot be accidentally corrupted by rogue kernels. But that concern is esoteric enough that the unified address space feature in CUDA 4.0 will cause all pinned allocations to be mapped by default.

Besides the points raised by the Sanders/Kandrot book, other things to keep in mind:

writing to host memory from a kernel (e.g. to post results to the CPU) is nice in that the GPU does not have any latency to cover in that case, and
it is VERY IMPORTANT that the memory operations be coalesced - otherwise, even SM 2.x and later GPUs take a big bandwidth hit.

answered Oct 05 '22 17:10

ArchaeaSoftware

Related questions
                            
                                Are cuda kernel calls synchronous or asynchronous
                            
                                cudaStreamSynchronize vs CudaDeviceSynchronize vs cudaThreadSynchronize
                            
                                CUDA driver version is insufficient for CUDA runtime version
                            
                                In a CUDA kernel, how do I store an array in "local thread memory"?
                            
                                CUDA apps time out & fail after several seconds - how to work around this?
                            
                                Revert Apple Clang Version For NVCC
                            
                                CMake Error: Variables are set to NOTFOUND
                            
                                What does #pragma unroll do exactly? Does it affect the number of threads?
                            
                                When is CUDA's __shared__ memory useful?
                            
                                How does CUDA assign device IDs to GPUs?
                            
                                How to remove cuda completely from ubuntu?
                            
                                Why has atomicAdd not been implemented for doubles?
                            
                                What are the differences between CUDA compute capabilities?
                            
                                Ubuntu 16.04, CUDA 8 - CUDA driver version is insufficient for CUDA runtime version
                            
                                Should I unify two similar kernels with an 'if' statement, risking performance loss?
                            
                                How can I make tensorflow run on a GPU with capability 2.x?
                            
                                Is branch divergence really so bad?
                            
                                Can I program Nvidia's CUDA using only Python or do I have to learn C?
                            
                                Setting up Visual Studio Intellisense for CUDA kernel calls
                            
                                cuda block synchronization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With