Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA: Why is there a benefit to having more than 8 threads per block?

Tags:

cuda

I am a mathematician using CUDA for some numerical integration. My understanding is that each Nvidia streaming multiprocessor has 8 CUDA cores. So to me it seems that there is no benefit to using more than 8 threads per block. However, when I run my code I get huge performance gain by using 32 threads per block as opposed to 8 threads per block.

Also I noticed there is huge gain using more than 12 blocks ( even though my card only has 12 streaming multiprocessors).

Is there a reason for this?

like image 419
Mykie Avatar asked Dec 12 '22 00:12

Mykie


2 Answers

talonmies and chaohuang provide good information in the comments, and you should look into that (not sure why these aren't answers, but that's their call). In any event, I will provide an abbreviated partial answer to explain something that you might not be considering.

Let's say that you have 8 threads of control, and 8 processors. If all the instructions in all 8 threads are on-chip instructions taking only a single cycle, then all 8 threads will finish in n cycles (assuming n total instructions per thread).

Now let's say that each thread of control consists of n instructions, where a fraction r of these are off-chip memory instructions, which take, e.g., 100 cycles to complete. These 8 threads will now take [(1 - r) + 100r]n cycles to complete. If r=0.1, this is about 11 times more than the previous case.

Now let's say that we have 16 threads. When the first batch of 8 threads is blocked on the slow operations, the other threads can execute; on-chip instructions can execute, and off-chip instructions can start. So instead of needing 2[(1 - r) + 100r]n cycles to complete all threads, you might need only a little more than [(1 - r) + 100r]n. In essence, because you have some room to overlap waiting threads with other threads, you can add more threads for free.

This is the great strength of the GPU model: massive parallelism to overcome long latency. It takes a long time to do a little bit of work, but not much more time to do a lot more work. Note that occupancy - related to the amount of work (in threads) you have ready to hide latency - isn't all that important for peak performance when the arithmetic intensity (related to r in the above formulae) is high. You might play around with the CUDA Occupancy Calculator to see the effect I descibe for different scenarios.

like image 82
Patrick87 Avatar answered Dec 14 '22 22:12

Patrick87


The short answer is latency hiding.

If you only have as many units of work (threads & blocks) as you have cores to work on them and execution hits a memory operation that needs hundreds of clock cycles to complete, the GPU has nothing else to work on so the cores sit idle until the memory op completes. That's wasting compute cycles.

If you offer more units of work than you have cores to do the work, then when one of the units of work hits a long-latency memory operation, the hardware scheduler can swap some other unit of work into the core(s) so that the cores are kept busy doing productive work while the long-latency memory operation completes. Having an excess of threads or blocks provides a better opportunity to use all the compute cycles when there are long-latency memory ops in the mix.

like image 37
dthorpe Avatar answered Dec 15 '22 00:12

dthorpe