Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The cost of CUDA global memory transactions

Tags:

cuda

According to CUDA 5.0 Programming Guide, if I am using both L1 and L2 caching (on Fermi or Kepler), all global memory operations are done using 128-byte memory transactions. However, if I am using L2 only, 32-byte memory transactions are used (chapter F.4.2).

Let us assume that all caches are empty. If I have a warp, with each thread accessing a single 4-byte word, in a perfectly aligned fashion, this will result in 1x128B transaction in L1+L2 case, and in 4x32B transaction in L2-only case. Is that right?

My question is - are the 4 32B transactions any slower than a single 128B transaction? My intuition from pre-Fermi hardware suggests that it would be slower, but perhaps this is no longer true on the newer hardware? Or maybe I should just look at the amount of bandwidth utilization to judge the efficiency of my memory access?

like image 671
CygnusX1 Avatar asked Oct 09 '12 10:10

CygnusX1


People also ask

What is global memory in CUDA?

Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two.

Why is shared memory faster CUDA?

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.

Where is CUDA shared memory?

Because CUDA shared memory is located on chip, its memory bandwidth is much larger than the global memory which is located off chip. Therefore, CUDA kernel optimization by caching memory access on shared memory can improve the performance of some operations significantly, especially for those memory-bound operations.

What is coalescing in CUDA?

Global Memory CoalescingThe device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth (on older hardware of compute capability less than 2.0, transactions are coalesced within half warps of 16 threads rather than whole warps).


1 Answers

Yes, in caching mode, a single 128byte transaction will be generated (as seen from the L1 cache level.) In uncached mode, four 32byte transactions will be generated (as seen from the L2 cache level - it's still a single 128byte request coming from the warp due to coalescing.) In the case you describe, the four 32byte transactions are not any slower, for a fully coalesced access, regardless of cached or uncached mode. The memory controller (on a given GPU) should generate the same transactions to satisfy the warp's request in either case. Since the memory controller is composed of a number (up to 6) of "partitions", each of which has a 64bit wide path, ultimately multiple memory transactions (coming across multiple partitions, perhaps) will be used to satisfy either request (4x32byte or 1x128byte). The specific number of transactions and organization across partitions may vary from GPU to GPU, (and isn't part of your question, but a GPU with DDR-pumped memory will return 16bytes per partition per memory transaction, and with QDR-pumped memory, will return 32bytes per partition per memory transaction). This isn't specific to CUDA 5 either. You might want to review one of NVIDIA's webinars for this material, in particular "CUDA Optimization : Memory Bandwidth Limited Kernels". Even if you don't want to watch the video, a quick review of the slides will remind you of the various differences between so-called "cached" and "uncached" accesses (this is referring to L1), and also give you the compiler switches needed to try each case.

Another reason to review the slides is that it will remind you of under what circumstances you might want to try "uncached" mode. In particular, if you have a scattered (uncoalesced) access pattern coming from your warps, uncached mode access may yield an improvement because there is less "wastage" when requesting 32byte quantities from memory to satisfy the request of a single thread as compared to 128byte quantities. However in response to your final question, it's fairly difficult to be analytical about it, because presumably your code is a mix of ordered and disordered access patterns. Since uncached mode is turned on via compiler switch, the suggestion given in the slides is simply to "try your code both ways" and see which runs faster. In my experience, running in uncached mode rarely yields a perf improvement.

EDIT: Sorry I had the link and title for the wrong presentation. Fixed slide/video link and webinar title.

like image 135
Robert Crovella Avatar answered Sep 18 '22 13:09

Robert Crovella