Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA - Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship?

Tags:

I know that there are multiprocessors on a CUDA GPU which contain CUDA cores in them. In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

My question is how the block size and the multiprocessor count - warp size are exactly related. Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors. Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor. The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor. If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024. Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor. And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

So, is my model of the CUDA parallel execution is correct? If not, what is wrong or missing? I want to fine tune the current project I am working on, so I need the most correct working model of the whole thing.

like image 808
Ufuk Can Bicici Avatar asked Jul 19 '12 15:07

Ufuk Can Bicici


People also ask

What is the relationship between warps thread blocks and CUDA cores?

All the threads in a warp executes concurrently on the resources of the SM. The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores. If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.

How many threads can be simultaneously scheduled on your CUDA device which contains 14 streaming multiprocessors?

Check this out. In Table 14, you will see that the above rule applies to every compute capability. The number 1536 means that each multiprocessor (called SM for Streaming Processor in cuda) can have maximum of 1536 active threads.

What is warp size in CUDA?

CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps.

How many threads are in a block CUDA?

Here are a few noticeable points: CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.


1 Answers

In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

A GTX590 contains 2x the numbers you mentioned, since there are 2 GPUs on the card. Below, I focus on a single chip.

Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors.

Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don't know why.

Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor.

The relationship between threads and cores is not that direct. There are 32 "basic" ALUs in each SM. The ones that handle such things as single precision floating point and most 32 bit integer and logic instructions. But there are only 16 load/store units, so if the warp instruction that is currently being processed is a load/store, it must be scheduled twice. And there are only 4 special function units, that do things such as trigonometry. So these instructions must be scheduled 32 / 4 = 8 times.

The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.

No, there can be many more than 32 threads "in flight" at the same time in a single SM.

If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024.

No, it is not only memory operations that cause warps to be replaced. The ALUs are also deeply pipelined, so new warps will be swapped in as data dependencies occur for values that are still in the pipeline. So, if the code contains two instructions where the second one uses the output from the first, the warp will be put on hold while the value from the first instruction makes its way through the pipeline.

Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor.

A multiprocessor can process more than one block at a time but a block cannot move to another MP once processing on it has started. The number of threads in a block that are currently in flight depends on how many resources the block uses. The CUDA Occupancy Calculator will tell you how many blocks will be in flight at the same time based on the resource usage of your specific kernel.

And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

No, a block cannot be divided to work on two multiprocessors. A whole block is always processed by a single multiprocessor. If the given multiprocessor does not have enough resources to process at least one block with your kernel, you will get a kernel launch error and your program won't run at all.

It depends on how you define a thread as "running". The GPU will typically have many more than 512 threads consuming various resources on the chip at the same time.

See @harrism's answer in this question: CUDA: How many concurrent threads in total?

like image 58
Roger Dahl Avatar answered Oct 24 '22 17:10

Roger Dahl