Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Techniques to Reduce CPU to GPU Data Transfer Latency

I've been looking into ways to reduce the latency caused by transferring data back and forth from the CPU and GPU. When I first started using CUDA I did notice that data transfer between the CPU and GPU did take a few seconds, but I didn't really care because this isn't really a concern for the small programs I'm been writing. In fact, the latency probably isn't much of a problem for vast majority of the programs that utilize GPUs, video games included, because they're still a lot faster than if they would have run on the CPU.

However, I'm a bit of an HPC enthusiast and I became concerned with the direction of my studies when I saw the massive discrepancy between the Tianhe-I theoretical peak FLOPS and the actual LINPACK measured performance. This has raised my concerns about whether I'm taking the right career path.

Use of pinned memory (page-locked) memory through the use of the cudaHostAlloc() function is one method of reducing latency (quite effective), but are there any other techniques I'm not aware of? And to be clear, I'm talking about optimizing the code, not the hardware itself (that's NVIDIA and AMD's jobs).

Just as a side question, I'm aware that Dell and HP sell Tesla servers. I'm curious as to how well a GPU leverages a database application, where you would need a constant read from the hard drive (HDD or SSD), an operation only the CPU can perform,

like image 411
sj755 Avatar asked Jun 28 '11 02:06

sj755


People also ask

How do you optimize data transfers in Cuda?

Minimize the amount of data transferred between host and device when possible, even if that means running kernels on the GPU that get little or no speed-up compared to running them on the host CPU. Higher bandwidth is possible between the host and the device when using page-locked (or “pinned”) memory.

What is pinned memory GPU?

– Pinned memory are virtual memory pages that are specially marked so that. they cannot be paged out. – Allocated with a special system API function call. – a.k.a. Page Locked Memory, Locked Pages, etc. – CPU memory that serve as the source or destination of a DMA transfer must.

Which function is used to transfer data from source to destination in Cuda?

cudaMemcpy() − This API function is used for memory data transfer. It requires four parameters as input: Pointer to the destination, pointer to the source, amount of data to be copied (in bytes), and the direction of transfer.

What is the architecture of a GPU?

A single GPU device consists of multiple Processor Clusters (PC) that contain multiple Streaming Multiprocessors (SM). Each SM accommodates a layer-1 instruction cache layer with its associated cores.


1 Answers

There are a few ways to address CPU-GPU communication overhead - I hope that's what you mean by latency and not the latency of the transfer itself. Note that I deliberately used the term address instead of reduce as you do not necessarily need to reduce the latency if you can hide it. Also note that I am much more familiar with CUDA, so below I only refer to CUDA, but some features are also available in OpenCL.

As you mentioned page-locked memory has the very purpose of increasing. Additionally, one can map page-locked host memory to the GPU, mechanism which enables direct access of the data allocated from the GPU kernel without the need for additional data-transfer. This mechanism is called zero-copy transfer and it is useful if data is read/written only once accompanied by a substantial amount of computation and for GPUs with no separate memory (mobile). However, if the kernel accessing the zero-copied data is not strongly compute-bound and therefore the latency of data access cannot be hidden, page-locked but not mapped memory will be more efficient. Additionally, if the data does not fit into the GPU memory, zero-copy will still work.
Note that excessive amount of page-locked memory can cause serious slowdown on the CPU side.

Approaching the problem from a different angle, as tkerwin mentioned, asynchronous transfer (wrt the CPU thread talking to the GPU) is the key to hide CPU-GPU transfer latency by overlapping computation on the CPU with the transfer. This can be achieved with cudaMemcpyAsync() as well as using zero-copy with asynchronous kernel execution.
One can take this even further by using multiple streams to overlap transfer with kernel execution. Note that stream scheduling might need special attention for good overlapping; Tesla and Quadro cards have dual-DMA engine which enables simultaneous data transfer to and from GPU. Additionally, with CUDA 4.0 it became easier to use a GPU from multiple CPU threads, so in a multi-threaded CPU code each thread can send its own data to the GPU and launch kernels easier.

Finally, GMAC implements an asymmetric shared memory model for CUDA. One of its very interesting features is the coherency models it provides, in particular lazy- and rolling update enabling the transfer of only data modified on the CPU in a blocked fashion.
For more details see the following paper: Gelado et al. - An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems.

like image 57
pszilard Avatar answered Oct 21 '22 00:10

pszilard