Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reaching theoretical GPU global memory bandwidth

PREAMBLE: Assume I use an NVIDIA GTX480 card in CUDA. The theoretical peak global memory bandwidth for this card is 177.4 GB/s: 384*2*1848/8 *1E9 = 177.4 GB/s

The 384 comes from the memory interface width, 2 form the DDR nature of the memory, 1848 is the memory clock frequency (in MHz), the 8 comes from the fact that i want to get my answer in Bytes.

Something similar can be computed for the shared memory: 4 bytes per bank * 32 banks * 0.5 banks per cycle * 1400MHz * 15 SMs = 1,344 GB/s

The number above factors in the number of SMs, that is, 15. Thus, to reach this max shared memory bandwidth I need to have all 15 SMs reading shared memory.

MY QUESTION: In order to reach the max global memory bandwidth, does it suffice to have only one SM read from global memory, or should all SMs attempt to read from global memory at the same time? More specifically, imagine I launch a kernel with one block with 32 threads. Then, if I have the one and only warp on SM-0, and all that I do in the kernel is read nonstop from global memory in a coalesced fashion, will I reach the 177.4 GB/s? Or should I launch at least 15 blocks, each with 32 threads, so that the 15 warps on SM-0 through SM-14 attempt to read at the same time?

The immediate thing to do would probably be to run a benchmark test to figure this out. I would like though to understand why what happens, happens.

like image 232
user1586099 Avatar asked Nov 03 '22 16:11

user1586099


1 Answers

As far as I know, GPUs' network-on-chip is crossbar of TPCs and Memory Controllers. Therefore, theoretically one SM can interleave memory accesses among different memory controllers to achieve full global bandwidth. But notice that each crossbar interface has a buffer and if this buffers is not large enough, the memory instructions in the active SM maybe stalled. Moreover, each SM has limited capacity to keep outstanding memory accesses. These issues may limit the memory bandwidth each SM can utilize. So, I think the answer to your question requires some microbenchmarking and I guess one SM cannot utilize entire global memory bandwidth.

like image 188
lashgar Avatar answered Nov 15 '22 11:11

lashgar