__forceinline__ effect at CUDA C __device__ functions

Question

There is a lot of advice on when to use inline functions and when to avoid it in regular C coding. What is the effect of __forceinline__ on CUDA C __device__ functions? Where should they be used and where be avoided?

Robert Crovella · Accepted Answer

Normally the nvcc device code compiler will make it's own decisions about when to inline a particular __device__ function and generally speaking, you probably don't need to worry about overriding that with the __forceinline__ decorator/directive.

cc 1.x devices don't have all the same hardware capabilities as newer devices, so very often the compiler will automatically inline functions for those devices.

I think the reason to specify __forceinline__ is the same as what you may have learned about host C code. It is usually used for optimization when the compiler might not otherwise inline the function (e.g. on cc 2.x or newer devices). This optimization (i.e. function call overhead) might be negligible if you were calling the function once, but if you were calling the function in a loop for example, making sure it was inlined might give noticeable improvement in code execution.

As a counter example, inlining and recursion generally have contra-indications. For a recursive function that calls itself, I don't think it's possible to handle arbitrary recursion and also strict inlining. So if you intend to use a function recursively (supported in cc 2.x and above) you probably wouldn't want to specify __forceinline__.

In general, I think you should let the compiler manage this for you. It will intelligently decide whether to inline a function.

forceinline effect at CUDA C device functions

Tags:

c

cuda

gpgpu

nvidia

Farzad

1 Answers

Robert Crovella

Recent Activity

Donate For Us