Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving kernel performance by increasing occupancy?

Tags:

cuda

Here is an output of Compute Visual Profiler for my kernel on GT 440:

  • Kernel details: Grid size: [100 1 1], Block size: [256 1 1]
  • Register Ratio: 0.84375 ( 27648 / 32768 ) [35 registers per thread]
  • Shared Memory Ratio: 0.336914 ( 16560 / 49152 ) [5520 bytes per Block]
  • Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
  • Active threads per SM: 768 (Maximum Active threads per SM: 1536)
  • Potential Occupancy: 0.5 ( 24 / 48 )
  • Occupancy limiting factor: Registers

Please, pay your attention to the bullets marked bold. Kernel execution time is 121195 us.

I reduced a number of registers per thread by moving some local variables to the shared memory. The Compute Visual Profiler output became:

  • Kernel details: Grid size: [100 1 1], Block size: [256 1 1]
  • Register Ratio: 1 ( 32768 / 32768 ) [30 registers per thread]
  • Shared Memory Ratio: 0.451823 ( 22208 / 49152 ) [5552 bytes per Block]
  • Active Blocks per SM: 4 (Maximum Active Blocks per SM: 8)
  • Active threads per SM: 1024 (Maximum Active threads per SM: 1536)
  • Potential Occupancy: 0.666667 ( 32 / 48 )
  • Occupancy limiting factor: Registers

Hence, now 4 blocks are simultaneously executed on a single SM versus 3 blocks in the previous version. However, the execution time is 115756 us, which is almost the same! Why? Aren't the blocks totally independent being executed on different CUDA cores?

like image 463
AdelNick Avatar asked Oct 12 '11 09:10

AdelNick


1 Answers

You are implicitly assuming that higher occupancy automatically translates into higher performance. That is most often not the case.

The NVIDIA architecture needs a certain number of active warps per MP in order to hide the instruction pipeline latency of the GPU. On your Fermi based card, that requirement translates to a minimum occupancy of about 30%. Aiming for higher occupancies than that minimum will not necessarily result in higher throughput, as the latency bottleneck can have moved to another part of the GPU. Your entry level GPU doesn't have a lot of memory bandwidth, and it is quite possible that 3 blocks per MP is sufficient to make you code memory bandwidth limited, in which case increasing the number of blocks won't have any effect on performance (it might even go down because of increased memory controller contention and cache misses). Further, you said you spilled variables to shared memory in order to reduce the register foot print of the kernel. On Fermi, shared memory only has about 1000 Gb/s of bandwidth, compared to about 8000 Gb/s for registers (see the link below for the microbenchmarking results which demonstrate this). So you have moved variables to slower memory, which may also have a negative effect on performance, offsetting any benefit which high occupancy affords.

If you have not already seen it, I highly recommend Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy" (pdf). Here is it shown how exploiting instruction level parallelism can increase GPU throughput to very high levels at very, very low levels of occupancy.

like image 56
talonmies Avatar answered Oct 19 '22 00:10

talonmies