I am trying to understand the basic architecture of a GPU. I have gone through a lot of material including this very good SO answer. But I am still confused not able to get a good picture of it.
My Understanding:
- A GPU contains two or more Streaming Multiprocessors (SM) depending upon the compute capablity value.
- Each SM consists of Streaming Processors (SP) which are actually responisible for the execution of instructions.
- Each block is processed by SP in form of warps (32 threads).
- Each block has access to a shared memory. A different block cannot access the data of some other block's shared memory.
Confusion:
In the following image, I am not able to understand which one is the Streaming Multiprocessor (SM) and which one is SP. I think that Multiprocessor-1 respresent a single SM and Processor-1 (upto M) respresent a single SP. But I am not sure about this because I can see that each Processor (in blue color) has been provided a Register but as far as I know, a register is provided to a thread unit.
It would be very helpful to me if you could provide some basic overview w.r.t this image or any other image.
First, some comments on the "My understanding" portion of the question:
- The number of SMs depends on GPU model - there are low-end models with just one SM, and high-end ones with as many as 30! Compute capability defines what those SMs are capable of, but not how many SMs there are in a GPU.
- Each thread block is assigned to an SM, not SP. There can be multiple thread blocks running on a given SM, subject to its resource limitations.
On to the diagram:
- Orange boxes are indeed SMs, just as they are labeled. Each SM has shared memory pool, divided between all thread blocks running on this SM.
- Blue boxes are SPs. Since SP is a scalar lane, it runs one thread, and each thread is provided with its own set of registers, again, just like the diagram shows.
Addressing the follow-up question:
- Each SM can have multiple resident thread blocks. The maximum number of thread blocks resident on SM is determined by compute capability. Achieved number can be lower than maximum when it is limited by the number of registers or the amount of shared memory consumed by each thread block.
- SM will then schedule instruction from all warps resident on it, picking among warps that have instructions ready for execution - and those warps may come from any thread block resident on this SM. You generally want to have many warps resident, so that at any given moment of time SPs can be kept busy running instructions from whatever warps are ready.
- Number of cores per SM is not a very useful metric, and you need not think too much about it at this point.