Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is CUDA pinned memory so fast?

Tags:

c++

c

linux

cuda

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:

mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;

In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

like image 595
Gearoid Murphy Avatar asked Apr 20 '11 21:04

Gearoid Murphy


People also ask

What is Cuda pinned memory?

Pinned Host Memory As you can see in the figure, pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory.

What does pinned memory mean?

Pinned memory is virtual memory pages that are specially marked so that they cannot be paged out. They are allocated with special system API function calls. The important point for us is that CPU memory that serves as the source of destination of a DMA transfer must be allocated as pinned memory.

How does Cuda allocate memory?

Memory management on a CUDA device is similar to how it is done in CPU programming. You need to allocate memory space on the host, transfer the data to the device using the built-in API, retrieve the data (transfer the data back to the host), and finally free the allocated memory.

What is page locked memory?

With paged memory, the specific memory, which is allowed to be paged in or paged out, is called pageable memory. Conversely, the specific memory, which is not allowed to be paged in or paged out, is called page-locked memory or pinned memory. Page-locked memory will not communicate with hard drive.

Why is Cuda memory transfer so slow?

The data at this memory location is usable by the host. To transfer this data to the device, the CUDA run time copies this memory to a temporary pinned memory and then transfers to the device memory. Hence, there are two memory transfers. Therefore, this type of memory allocation and transfer is slow.

How does pinning affect performance in CUDA?

“You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs.” but what about any effects in CUDA land? The ramifications of pinning are almost entirely on the host side.

What is the difference between pinned and mapped memory in CUDA?

The above numbers are obtained by profiling the compiled CUDA code with NVIDIA NSIGHT Systems profiler. Compared to pageable memory, pinned memory has only 1 memory transfer. Hence memory transfer time is less for pinned memory than pageable memory. In mapped memory, the address is mapped to the device address space.

How does CUDA transfer data from host to device?

The memory allocated in host is by default pageable memory. The data at this memory location is usable by the host. To transfer this data to the device, the CUDA run time copies this memory to a temporary pinned memory and then transfers to the device memory.


1 Answers

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

As described here http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

like image 61
osgx Avatar answered Sep 23 '22 02:09

osgx