i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With