There is a lot of advice on when to use inline functions and when to avoid it in regular C coding. What is the effect of __forceinline__
on CUDA C __device__
functions? Where should they be used and where be avoided?
Normally the nvcc
device code compiler will make it's own decisions about when to inline a particular __device__
function and generally speaking, you probably don't need to worry about overriding that with the __forceinline__
decorator/directive.
cc 1.x devices don't have all the same hardware capabilities as newer devices, so very often the compiler will automatically inline functions for those devices.
I think the reason to specify __forceinline__
is the same as what you may have learned about host C code. It is usually used for optimization when the compiler might not otherwise inline the function (e.g. on cc 2.x or newer devices). This optimization (i.e. function call overhead) might be negligible if you were calling the function once, but if you were calling the function in a loop for example, making sure it was inlined might give noticeable improvement in code execution.
As a counter example, inlining and recursion generally have contra-indications. For a recursive function that calls itself, I don't think it's possible to handle arbitrary recursion and also strict inlining. So if you intend to use a function recursively (supported in cc 2.x and above) you probably wouldn't want to specify __forceinline__
.
In general, I think you should let the compiler manage this for you. It will intelligently decide whether to inline a function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With