cuda inline and noinline device functions

Tags:

According to the documentation, in devices of compute capability 1.x the compiler will inline __device__ functions by default, but for devices of compute capability 2.x and higher it will only do so if deemed appropriate by the compiler. When is it appropriate not to? There are also qualifiers such as __noinline__ and __forceinline__. In which cases is it better not to inline a __device__ function?

621

asked Jun 13 '14 14:06

Michael

2 Answers

The compiler heuristic for inlining presumably evaluates the potential performance benefit from inlining due to the elimination of function call overhead against other characteristics including compile time. Aggressive inlining can lead to very large code that cause very long compile times. From observing the code generated for many different kernels, the CUDA compiler seems to inline in the vast majority of cases. Note that in some cases, inlining is currently not possible, for example when the called function is in a different, separately compiled, compilation unit.

In my experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. I have used __noinline__ to limit code size and thus reduce excessive compile times. Use of __noinline__ has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where use of __noinline__ improved performance, but of course such cases could exist, possibly due to instruction cache effects.

109

answered Nov 27 '22 14:11

njuffa

I've experienced it that if you force __device__ function call to be compiled inline, it can decreases runtime to half. Just in a recent one, I made a function call (which passed just 5 variables to function) inline and kernel execution time decreased from 9.5ms to 4.5ms (almost half). And if you consider that you want to execute the same kernel hundred millions of times with total runtime of a week or more (like my case and many others that work on CFD or MD projects), increase in compile time is nothing important comparing to huge saving in runtime.

All in all, I think it worth to try inline function call impact on runtime especially for codes with very long runtimes.

answered Nov 27 '22 15:11

Mo Sani

Related questions
                            
                                Emulating FP64 with 2 FP32 on a GPU
                            
                                Tensorflow: CUDA_VISIBLE_DEVICES doesn't seem to work
                            
                                CMake CUDA separate compilation static lib link error on Windows but not on Ubuntu
                            
                                Expected number of bank conflicts in shared memory at random access
                            
                                how to link library (e.g. CUBLAS, CUSPARSE) for CUDA on windows
                            
                                Is it worthwhile to pass kernel parameters via shared memory?
                            
                                nvcc.exe linking error Microsoft Visual Studio configuration file 'vcvars64.bat' could not found
                            
                                using thrust::sort inside a thread
                            
                                Should I look into PTX to optimize my kernel? If so, how?
                            
                                Constant memory usage in CUDA code
                            
                                how to keep kernel code inside separate .cu file other than the main .cpp?
                            
                                Parallel implementation for multiple SVDs using CUDA
                            
                                What is the difference between __ldg() intrinsic and a normal execution?
                            
                                How to check if cuda is installed correctly on Anaconda
                            
                                Sorting 3 arrays by key in CUDA (using Thrust perhaps)
                            
                                check global device memory using cuda-gdb
                            
                                How to observe CUDA events and metrics for a subsection of an executable (e.g. only during a kernel execution time)?
                            
                                Check whether the code is running on the GPU or CPU
                            
                                Dealing with Boundary conditions / Halo regions in CUDA
                            
                                Is restrict(amp) more restrictive than CUDA kernel code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

cuda inline and noinline device functions

Tags:

compiler-optimization

inline

cuda

device

Michael

People also ask

2 Answers

njuffa

Mo Sani

Recent Activity

Donate For Us