According to the documentation, in devices of compute capability 1.x the compiler will inline __device__
functions by default, but for devices of compute capability 2.x and higher it will only do so if deemed appropriate by the compiler. When is it appropriate not to? There are also qualifiers such as __noinline__
and __forceinline__
. In which cases is it better not to inline a __device__
function?
An inline function is one for which the compiler copies the code from the function definition directly into the code of the calling function rather than creating a separate set of instructions in memory. This eliminates call-linkage overhead and can expose significant optimization opportunities.
Inline functions behave like macros. When an inline function gets called, instead of transferring the control to the function, the call gets substituted with the function code. Thus this saves time and improves performance.
__global__ - Runs on the GPU, called from the CPU or the GPU*.
The compiler heuristic for inlining presumably evaluates the potential performance benefit from inlining due to the elimination of function call overhead against other characteristics including compile time. Aggressive inlining can lead to very large code that cause very long compile times. From observing the code generated for many different kernels, the CUDA compiler seems to inline in the vast majority of cases. Note that in some cases, inlining is currently not possible, for example when the called function is in a different, separately compiled, compilation unit.
In my experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. I have used __noinline__
to limit code size and thus reduce excessive compile times. Use of __noinline__
has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where use of __noinline__
improved performance, but of course such cases could exist, possibly due to instruction cache effects.
I've experienced it that if you force __device__
function call to be compiled inline, it can decreases runtime to half. Just in a recent one, I made a function call (which passed just 5 variables to function) inline and kernel execution time decreased from 9.5ms to 4.5ms (almost half). And if you consider that you want to execute the same kernel hundred millions of times with total runtime of a week or more (like my case and many others that work on CFD or MD projects), increase in compile time is nothing important comparing to huge saving in runtime.
All in all, I think it worth to try inline function call impact on runtime especially for codes with very long runtimes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With