In what types of loops is it best to use the #pragma unroll directive in CUDA?

Tags:

In CUDA it is possible to unroll loops using the #pragma unroll directive to improve performance by increasing instruction level parallelism. The #pragma can optionally be followed by a number that specifies how many times the loop must be unrolled.

Unfortunately the docs do not give specific directions on when this directive should be used. Since small loops with a known trip count are already unrolled by the compiler, should #pragma unroll be used on larger loops? On small loops with a variable counter? And what about the optional number of unrolls? Also is there recommended documentation about cuda specific loop unrolling?

875

asked Nov 04 '12 19:11

charis

1 Answers

There aren't any fast and hard rules. The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1 (to prevent unrolling) more often than any other unrolling attribute. The reasons for turning off loop unrolling are twofold:

(1) When a loop is unrolled completely, register pressure can increase. For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. Increased register pressure can lead to lower performance due to register spilling.

(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exactly a multiple of the unrolling factor. For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide the best decision.

In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead.

Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice.

190

answered May 20 '23 13:05

njuffa

Related questions
                            
                                How to speed up numpy array-filling in python?
                            
                                PLY: quickly parsing long lists of items?
                            
                                How can I speed up this call to quantile in Matlab?
                            
                                Why does reversing a loop make it slower?
                            
                                Optimizing array transposing function
                            
                                Maximize resource utilization given multiple types of resources and specific mixtures of resources per task
                            
                                Fast transposition of an image and Sobel Filter optimization in C (SIMD)
                            
                                Can functions be optimized away if they have side effects?
                            
                                Fastest Method to Copy Large Number of Values in Excel VBA
                            
                                Why Hibernate sometimes ignores FetchMode.JOIN?
                            
                                Convolutional Neural Network (CNN) with max-pooling
                            
                                Why use dynamic variables (variable variables) in PHP or other languages [duplicate]
                            
                                Repeat last column in numpy array
                            
                                Should we always return by ref if we can?
                            
                                Creating temporary variables to enhance readability
                            
                                Does gcc automatically "unroll" if-statements?
                            
                                What do I need to know when maintaining a Java app with a large number of threads?
                            
                                Quickselect with array of structs has non linear running time
                            
                                Overflow-safe modular addition and subtraction in C?
                            
                                How to accelerate matrix multiplications in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In what types of loops is it best to use the #pragma unroll directive in CUDA?

Tags:

optimization

cuda

loop-unrolling

charis

People also ask

1 Answers

njuffa

Recent Activity

Donate For Us