Here is an output of Compute Visual Profiler for my kernel on GT 440:
Please, pay your attention to the bullets marked bold. Kernel execution time is 121195 us
.
I reduced a number of registers per thread by moving some local variables to the shared memory. The Compute Visual Profiler output became:
Hence, now 4
blocks are simultaneously executed on a single SM versus 3
blocks in the previous version. However, the execution time is 115756 us
, which is almost the same! Why? Aren't the blocks totally independent being executed on different CUDA cores?
You are implicitly assuming that higher occupancy automatically translates into higher performance. That is most often not the case.
The NVIDIA architecture needs a certain number of active warps per MP in order to hide the instruction pipeline latency of the GPU. On your Fermi based card, that requirement translates to a minimum occupancy of about 30%. Aiming for higher occupancies than that minimum will not necessarily result in higher throughput, as the latency bottleneck can have moved to another part of the GPU. Your entry level GPU doesn't have a lot of memory bandwidth, and it is quite possible that 3 blocks per MP is sufficient to make you code memory bandwidth limited, in which case increasing the number of blocks won't have any effect on performance (it might even go down because of increased memory controller contention and cache misses). Further, you said you spilled variables to shared memory in order to reduce the register foot print of the kernel. On Fermi, shared memory only has about 1000 Gb/s of bandwidth, compared to about 8000 Gb/s for registers (see the link below for the microbenchmarking results which demonstrate this). So you have moved variables to slower memory, which may also have a negative effect on performance, offsetting any benefit which high occupancy affords.
If you have not already seen it, I highly recommend Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy" (pdf). Here is it shown how exploiting instruction level parallelism can increase GPU throughput to very high levels at very, very low levels of occupancy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With