Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Link-time optimization and inline

In my experience, there's lot of code that explicitly uses inline functions, which comes at a tradeoff:

  1. The code becomes less succinct and somewhat less maintainable.
  2. Sometimes, inlining can greatly increase run-time performance.
  3. Inlining is decided at a fixed point in time, maybe without a terribly good foreknowledge of its uses, or without considering all (future) surrounding circumstances.

The question is: does link-time optimization (e.g., in GCC) render manual inlining, e.g., declaring in C99 a function "inline" and providing an implementation, obsolete? Is it true that we don't need to consider inlining for most functions ourselves? What about functions that do always benefit from inlining, e.g., deg_to_rad(x)?

Clarification: I am not thinking about functions that are in the same translation-unit anyway, but about functions that should logically reside in different translation-units.

Update: I have often seen an opposition against "inline", and it was suggested obsolete. Personally, however, I do see explicitly inlined functions often: as functions defined in a class body.

like image 719
ccom Avatar asked Aug 12 '11 21:08

ccom


People also ask

When should I use link time optimization?

Link time optimization is relevant in programming languages that compile programs on a file-by-file basis, and then link those files together (such as C and Fortran), rather than all at once (such as Java's just-in-time compilation (JIT)).

What is inline optimization?

Inlining is the process of replacing a subroutine or function call at the call site with the body of the subroutine or function being called. This eliminates call-linkage overhead and can expose significant optimization opportunities.

Which is faster macro or inline function?

By declaring a function inline, you can direct GCC to integrate that function's code into the code for its callers.

Does inline make code faster?

inline functions might make it faster: As shown above, procedural integration might remove a bunch of unnecessary instructions, which might make things run faster. inline functions might make it slower: Too much inlining might cause code bloat, which might cause “thrashing” on demand-paged virtual-memory systems.


2 Answers

Even with LTO, a compiler still has to use heuristics to determine whether or not to inline a function for every call (note it makes the decision not per function, but per call). The heuristic takes into account factors like - is it in a loop, is the loop unrolled, how big the function is, how frequently it is called globally, etc. The compiler will certainly never be able to accurately determine how frequently code is called, and whether or not the code expansion is likely to blow out the instruction/trace/loop/microcode caches of a particular CPU at compile time.

Profile Guided Optimization is supposed to be a step towards addressing this, but if you've ever tried it, you are likely to have noticed that you can get a swing in performance in the order of 0-2%, and it can be in either direction! :-) It's still a work in progress.

If performance is your ultimate goal, and you really know what you are doing, and really do a thorough analysis of your code, what one really needs is a way to tell the compiler to inline or not inline on a per-call basis, not a per-function hint. In practice I have managed this by using compiler specific "force_no_inline" type hints for cases I don't want inlining, and a separate "force_inline" copy (or macro in the rare case this fails) of the function for when I want it inlined. If anyone knows how to do this in a cleaner way with compiler specific hints (for any C/C++ compilers), please let me know.

To specifically address your points:

1.The code becomes less succinct and somewhat less maintainable.

Generally, no - it's just a keyword hint that controls how it is inlined. However if you jump through hoops like I described in the last paragraph, then yes.

2.Sometimes, inlining can greatly increase run-time performance.

When leaving the compiler to its own devices - yes, it certainly can, but mostly doesn't. The compiler has good heuristics that make good although not always optimal inlining decisions. Specificially for the keyword, compilers may totally ignore the keyword, or use to keyword as a weak hint - in general they do seem adverse to inlining code that red flags their heuristics (like inlining a 16k function into a loop unrolled 16x).

3.Inlining is decided at a fixed point in time, maybe without a terribly good foreknowledge of its uses, or without considering all (future) surrounding circumstances.

Yes, it uses static analysis. Dynamic analysis can come from your insight and you manually controlling inlining on a per-call basis, or theoretically from PGO (which still sucks).

like image 140
Crowley9 Avatar answered Sep 29 '22 11:09

Crowley9


GCC 9 Binutils 2.33 experiment to show that LTO can inline

For those that are curious if ld inlines across object files or not, here is a quick experiment that confirms that it can:

main.c

int notmain(void);

int main(void) {
    return notmain();
}

notmain.c

int notmain(void) {
    return 42;
}

Compile with LTO and disassemble:

gcc -O3 -flto -ggdb3 -std=c99 -Wall -Wextra -pedantic -c -o main.o main.c
gcc -O3 -flto -ggdb3 -std=c99 -Wall -Wextra -pedantic -c -o notmain.o notmain.c
gcc -O3 -flto -ggdb3 -std=c99 -Wall -Wextra -pedantic -o main.out notmain.o main.o
gdb -batch -ex "disassemble/rs main" main.out

Disassembly output:

   0x0000000000001040 <+0>:     b8 2a 00 00 00  mov    $0x2a,%eax
   0x0000000000001045 <+5>:     c3      retq 

so we see that there is no callq or other jumps, which means that the call was inlined across the two object files.

Without -flto however we see:

   0x0000000000001040 <+0>:     f3 0f 1e fa     endbr64 
   0x0000000000001044 <+4>:     e9 f7 00 00 00  jmpq   0x1140 <notmain>

so how there is a JMPQ, which means that the call was not inlined.

Note that the compiler chose JMPQ which does not make any stack changes as would be done by a more naive CALLQ as an optimization, I think this is a trivial minimal case of a tail call optimization.

So yes, if you are using -flto, you don't need to worry about putting definitions in headers so they can be inlined.

The main downside of having definitions in headers is that they may slow down compilation. For C++ templates, you may also be interested in explicit template instantiation: Explicit template instantiation - when is it used?

Tested in Ubuntu 19.10 amd64.