Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cuda inline and noinline device functions

According to the documentation, in devices of compute capability 1.x the compiler will inline __device__ functions by default, but for devices of compute capability 2.x and higher it will only do so if deemed appropriate by the compiler. When is it appropriate not to? There are also qualifiers such as __noinline__ and __forceinline__. In which cases is it better not to inline a __device__ function?

like image 621
Michael Avatar asked Jun 13 '14 14:06

Michael


People also ask

What is the function of inline?

An inline function is one for which the compiler copies the code from the function definition directly into the code of the calling function rather than creating a separate set of instructions in memory. This eliminates call-linkage overhead and can expose significant optimization opportunities.

Does inline improve performance?

Inline functions behave like macros. When an inline function gets called, instead of transferring the control to the function, the call gets substituted with the function code. Thus this saves time and improves performance.

Which of the following function that is executed on device and callable from device only in GPU?

__global__ - Runs on the GPU, called from the CPU or the GPU*.


2 Answers

The compiler heuristic for inlining presumably evaluates the potential performance benefit from inlining due to the elimination of function call overhead against other characteristics including compile time. Aggressive inlining can lead to very large code that cause very long compile times. From observing the code generated for many different kernels, the CUDA compiler seems to inline in the vast majority of cases. Note that in some cases, inlining is currently not possible, for example when the called function is in a different, separately compiled, compilation unit.

In my experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. I have used __noinline__ to limit code size and thus reduce excessive compile times. Use of __noinline__ has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where use of __noinline__ improved performance, but of course such cases could exist, possibly due to instruction cache effects.

like image 109
njuffa Avatar answered Nov 27 '22 14:11

njuffa


I've experienced it that if you force __device__ function call to be compiled inline, it can decreases runtime to half. Just in a recent one, I made a function call (which passed just 5 variables to function) inline and kernel execution time decreased from 9.5ms to 4.5ms (almost half). And if you consider that you want to execute the same kernel hundred millions of times with total runtime of a week or more (like my case and many others that work on CFD or MD projects), increase in compile time is nothing important comparing to huge saving in runtime.

All in all, I think it worth to try inline function call impact on runtime especially for codes with very long runtimes.

like image 34
Mo Sani Avatar answered Nov 27 '22 15:11

Mo Sani