I'm currently experiencing some weird effect with gcc
(tested version: 4.8.4).
I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.
Since inlining across multiple .c
files is difficult (-flto
is not yet widely available), I've kept a lot of small functions (typically 1 to 5 lines of code each) into a common C file, into which I'm developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are just comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.
Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).
The strange effect is this one:
I recently added a new function fnew
to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c
file.
The simple fact that it exists makes the performance of the decoder function fdec
drops substantially, by more than 20%, which is way too much to be ignored.
Now, keep in mind than encoding and decoding operations are completely separated, and share almost nothing, save some minor typedef
(u32
, u16
and such) and associated operations (read/write).
When defining the new encoding function fnew
as static
, performance of the decoder fdec
increases back to normal. Since fnew
isn't called from the .c
, I guess it's the same as if it was not there (dead code elimination).
If static fnew
is now called from the encoder side, performance of fdec
remains strong.
But as soon as fnew
is modified, fdec
performance just drops substantially.
Presuming fnew
modifications crossed a threshold, I increased the following gcc
parameter: --param max-inline-insns-auto=60
(by default, its value is supposed to be 40.) And it worked : performance of fdec
is now back to normal.
And I guess this game will continue forever with each little modification of fnew
or anything else similar, requiring further tweak.
This is just plain weird. There is no logical reason for some little modification in function fnew
to have knock-on effect on completely unrelated function fdec
, which only relation is to be in the same file.
The only tentative explanation I could invent so far is that maybe the simple presence of fnew
is enough to cross some kind of global file threshold
which would impact fdec
. fnew
can be made "not present" when it's: 1. not there, 2. static
but not called from anywhere 3. static
and small enough to be inlined. But it's just hiding the problem. Does that mean I can't add any new function?
Really, I couldn't find any satisfying explanation anywhere on the net.
I was curious to know if someone already experienced some equivalent side-effect, and found a solution to it.
[Edit]
Let's go for some more crazy test.
Now I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew
, but the name of the function is obviously different, so let's call it wtf
.
When wtf
exists, it doesn't matter if fnew
is static or not, nor what is the value of max-inline-insns-auto
: performance of fdec
is back to normal.
Even though wtf
is not used nor called from anywhere... :'(
[Edit 2]
there is no inline
instruction. All functions are either normal or static
. Inlining decision is solely within compiler's realm, which has worked fine so far.
[Edit 3]
As suggested by Peter Cordes, the issue is not related to inline, but to instruction alignment. On newer Intel cpus (Sandy Bridge and later), hot loop benefit from being aligned on 32-bytes boundaries.
Problem is, by default, gcc
align them on 16-bytes boundaries. Which gives a 50% chance to be on proper alignment depending on length of previous code. Hence a difficult to understand issue, which "looks random".
Not all loop are sensitive. It only matters for critical loops, and only if their length make them cross one more 32-bytes instruction segment when being less ideally aligned.
inline functions might make it faster: As shown above, procedural integration might remove a bunch of unnecessary instructions, which might make things run faster. inline functions might make it slower: Too much inlining might cause code bloat, which might cause “thrashing” on demand-paged virtual-memory systems.
GCC automatically inlines member functions defined within the class body of C++ programs even if they are not explicitly declared inline . (You can override this with -fno-default-inline ; see Options Controlling C++ Dialect.)
Inline functions behave like macros. When an inline function gets called, instead of transferring the control to the function, the call gets substituted with the function code. Thus this saves time and improves performance.
Inline function expansion can speed up execution by eliminating function call overhead. This is particularly beneficial for very small functions that are called frequently. Function inlining involves a tradeoff between execution speed and code size, because the code is duplicated at each function call site.
Turning my comments into an answer, because it was turning into a long discussion. Discussion showed that the performance problem is sensitive to alignment.
There are links to some perf-tuning info at https://stackoverflow.com/tags/x86/info, include Intel's optimization guide, and Agner Fog's very excellent stuff. Some of Agner Fog's assembly optimization advice doesn't fully apply to Sandybridge and later CPUs. If you want the low-level details on a specific CPU, though, the microarch guide is very good.
Without at least an external link to code that I can try myself, I can't do more than handwave. If you don't post the code anywher, you're going to need to use profiling / CPU performance counter tools like Linux perf
or Intel VTune to track this down in a reasonable amount of time.
In chat, the OP found someone else having this issue, but with code posted. This is probably the same issue the OP is seeing, and is one of the major ways code alignment matters for Sandybridge-style uop caches.
There's a 32B boundary in the middle of the loop in the slow version. The instructions that start before the boundary decode to 5 uops. So in the first cycle, the uop cache serves up mov/add/movzbl/mov
. In the 2nd cycle, there's only a single mov
uop left in the current cache line. Then the 3rd cycle cycle issues the last 2 uops of the loop: add
and cmp+ja
.
The problematic mov
starts at 0x..ff
. I guess instructions that span a 32B boundary go into (one of) the uop cacheline(s) for their starting address.
In the fast version, an iteration only takes 2 cycles to issue: The same first cycle, then mov / add / cmp+ja
in the 2nd.
If one of the first 4 instructions had been one byte longer (e.g. padded with a useless prefix, or a REX prefix), there would be no problem. There wouldn't be an odd-man-out at the end of the first cacheline, because the mov
would start after the 32B boundary and be part of the next uop cache line.
AFAIK, assemble & check disassembly output is the only way to use longer versions of the same instructions (see Agner Fog's Optimizing Assembly) to get 32B boundaries at multiples of 4 uops. I'm not aware of a GUI that shows alignment of assembled code as you're editing. (And obviously, doing this only works for hand-written asm, and is brittle. Changing the code at all will break the hand-alignment.)
This is why Intel's optimization guide recommends aligning critical loops to 32B.
It would be really cool if an assembler had a way to request that preceding instructions be assembled using longer encodings to pad out to a certain length. Maybe a .startencodealign
/ .endencodealign 32
pair of directives, to apply padding to code between the directives to make it end on a 32B boundary. This could make terrible code if used badly, though.
Changes to the inlining parameter will change the size of functions, and bump other code over by multiples 16B. This is a similar effect to changing the contents of a function: it gets bigger and changes the alignment of other functions.
I was expecting the compiler to always make sure a function starts at ideal aligned position, using noop to fill gaps.
There's a tradeoff. It would hurt performance to align every function to 64B (the start of a cache line). Code density would go down, with more cache lines needed to hold the instructions. 16B is good, because it's the instruction fetch/decode chunk size on most recent CPUs.
Agner Fog has the low-level details for each microarch. He hasn't updated it for Broadwell, though, but the uop cache probably hasn't changed since Sandybridge. I assume there's one fairly small loop that dominates the runtime. I'm not sure exactly what to look for first. Maybe the "slow" version has some branch targets near the end of a 32B block of code (and hence near the end of a uop cacheline), leading to significantly less than 4 uops per clock coming out of the frontend.
Look at performance counters for the "slow" and "fast" versions (e.g. with perf stat ./cmd
), and see if any are different. e.g. a lot more cache misses could indicate false sharing of a cache line between threads. Also, profile and see if there's a new hotspot in the "slow" version. (e.g. with perf record ./cmd && perf report
on Linux).
How many uops/clock is the "fast" version getting? If it's above 3, frontend bottlenecks (maybe in the uop cache) that are sensitive to alignment could be the issue. Either that or L1 / uop-cache misses if different alignment means your code needs more cache lines than are available.
Anyway, this bears repeating: use a profiler / performance counters to find the new bottleneck that the "slow" version has, but the "fast" version doesn't. Then you can spend time looking at the disassembly of that block of code. (Don't look at gcc's asm output. You need to see the alignment in the disassembly of the final binary.) Look at the 16B and 32B boundaries, since presumably they'll be in different places between the two versions, and we think that's the cause of the problem.
Alignment can also make macro-fusion fail, if a compare/jcc splits a 16B boundary exactly. Although that is unlikely in your case, since your functions are always aligned to some multiple of 16B.
re: automated tools for alignment: no, I'm not aware of anything that can look at a binary and tell you anything useful about alignment. I wish there was an editor to show groups of 4 uops and 32B boundaries alongside your code, and update as you edit.
Intel's IACA can sometimes be useful for analyzing a loop, but IIRC it doesn't know about taken branches, and I think doesn't have a sophisticated model of the frontend, which is obviously the issue if misalignment breaks performance for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With