I have a very weird compiler behavior where G++ pulls computations into a hot loop, severly reducing the performance of the resulting code. What is going on here?
Consider this function:
#include <cstdint>
constexpr bool noLambda = true;
void funnyEval(const uint8_t* columnData, uint64_t dataOffset, uint64_t dictOffset, int32_t iter, int32_t limit, int32_t* writer,const int32_t* dictPtr2){
// Computation X1
const int32_t* dictPtr = reinterpret_cast<const int32_t*>(columnData + dictOffset);
// Computation X2
const uint16_t* data = (const uint16_t*)(columnData + dataOffset);
// 1. The less broken solution without lambda
if (noLambda) {
for (;iter != limit;++iter){
int32_t t=dictPtr[data[iter]];
*writer = t;
writer++;
}
}
// 2. The totally broken solution with lambda
else {
auto loop = [=](auto body) mutable { for (;iter != limit;++iter){ body(iter); } };
loop([=](unsigned index) mutable {
int32_t t=dictPtr[data[index]];
*writer = t;
writer++;
});
}
}
The problem here is that G++ somehow loves to pull computations X1
and X2
into the hot main loop, reducing the performance. Here are the details:
The function simply iterates over an array data
, looks up a value in a dictionary dictPtr
and writes it to a target memory location writer
.
data
and dictPtr
are computed at the beginning of the function. It has two flavors for doing so: one with a lambda, one without.
(Note that this function is just a minimal working example of much more complicated code. So please refrain from commenting that the lambda is unnecessary here. I am aware of this fact and in the original code it is necessary, unfortunately.)
The problem when compiling with the latest g++ (tried 8.1 and 7.2, same problem with older g++s as you can see in the godbolt links provided) with high optimization level (-O3 -std=c++14
) is the following:
Solution 2. (noLambda=false
) generates very bad code for the loop, even worse than the "naive" solution, because it assumes that it is a good idea to pull Computations X1 and X2, which are outside of the super hot main loop, into the super hot main loop, making it around 25% slower on my CPU.
https://godbolt.org/g/MzbxPN
.L3:
movl %ecx, %eax # unnecessary extra work
addl $1, %ecx
addq $4, %r9 # separate loop counter (pointer increment)
leaq (%rdi,%rax,2), %rax # array indexing with an LEA
movzwl (%rax,%rsi), %eax # rax+rsi is Computation X2, pulled into the loop!
leaq (%rdi,%rax,4), %rax # rax+rdx is Computation X1, pulled into the loop!
movl (%rax,%rdx), %eax
movl %eax, -4(%r9)
cmpl %ecx, %r8d
jne .L3
When using a usual for loop (noLambda=true
), then the code is better as X2 is no longer pulled into the loop, but X1 still is!:
https://godbolt.org/g/eVG75m
.L3:
movzwl (%rsi,%rax,2), %ecx
leaq (%rdi,%rcx,4), %rcx
movl (%rcx,%rdx), %ecx # This is Computation X1, pulled into the loop!
movl %ecx, (%r9,%rax,4)
addq $1, %rax
cmpq %rax, %r8
jne .L3
You can try out that this is really X1 in the loop by replacing dictPtr
(the computation X1) in the loop with dictPtr2
(a parameter), the instruction will vanish:
https://godbolt.org/g/nZ7TjJ
.L3:
movzwl (%rdi,%rax,2), %ecx
movl (%r10,%rcx,4), %ecx
movl %ecx, (%r9,%rax,4)
addq $1, %rax
cmpq %rax, %rdx
jne .L3
This is finally the loop as I want to have it. A simple loop that loads the values and stores the result without pulling random computations into it.
So, what is going on here? It is seldom a good idea to pull a computation into a hot loop, but G++ seems to think so here. This is costing me real performance. The lambda exacerbates the whole situation; it leads G++ to pull even more computations into the loop.
What makes this problem so severe is that this is really trivial C++ code without fancy features. If I cannot rely on my compiler producing perfect code for such a trivial example, I will need to check the assembly of all hot loops in my code to make sure everything is as fast as it could be. This also means that there are probably a huge number of programs affected by this.
You are using an unsigned 32-bits type for the array index (on line 21). This forces the compiler to consider at each step through the loop whether you might have overflowed its available range, in which case it needs to go back to the beginning of the array. The extra code you see is related to this check! There are at least three ways to avoid this over-cautious approach by the compiler:
You aren't complaining about the code before the loop starts, but here you have the same problem. Just make iter and limit int64_t, and you'll see it gets considerably shorter as the compiler no longer considers the possibility of array overflow.
So to recap: it is not the calculation of X1 and X2 that gets moved into the loop that causes the size to balloon, but the use of an incorrectly-typed array index variable.
Congratulations, you found a gcc bug. The main solution is to report it on GCC's bugzilla with the "missed-optimization" keyword. Your MCVEs are already great test-cases for the bug, so it shouldn't take too long to write one up. Copy/paste the code and some description. A link to this Q&A, and a link to the code on http://godbolt.org/, would be good, too.
Sometimes there are subtle microarchitectural reasons to use "extra" instructions, like xor
-zeroing the destination of popcnt
/lzcnt
or bsf
to avoid a false dependency on Intel CPUs, but that's not the case here. This is just bad; the movl %ecx, %eax
inside the loop could be the result of using an unsigned type narrower than a pointer, but even that could be done more efficiently; it's a missed optimization, too.
I haven't looked at GCC's GIMPLE or RTL dumps (internal representations) to find out more details. The only use of the calculated values are inside the loop, so I can imaging that the compiler's internal representation of the program logic maybe lost the difference between inside / outside the loop when transforming. Normally things that don't need to be in the loop are hoisted or sunk out of the loop.
But unfortunately it's not rare for gcc to leave an extra mov
instruction inside a loop to set up for code outside the loop. Especially when it might take multiple instructions outside the loop to get the same effect. This is usually a bad tradeoff when optimizing for performance instead of code-size. I haven't looked at asm output from Profile-Guided Optimization as much as I'd like, to see code where gcc knows which loops are really hot and unrolls them. But most code gets built without PGO, unfortunately, so code-gen without -fprofile-use
is still very important.
However, the core of this question is not how to get this particular example as fast as possible. Instead, I am rather atonished how the compiler can produce such deoptimizations in such a simple snippet of code. My main problem now is: I have somehow lost faith in my compiler, so I want to understand how this can happen so I can regain it.
Don't have faith in gcc! It's a very complex piece of machinery that often produces good results, but rarely produces optimal results.
This case is one of the most obvious and simple wrong choices I've seen its optimizer make (and is pretty disappointing), though. Usually the missed optimizations are somewhat more subtle (and hinge on microarchitectural details like addressing mode choices and uops / execution ports), or at least not so obviously trivial-looking to avoid. (Hoist that one instruction without changing any register allocation for the whole loop.)
But many loops bottleneck on memory, not uop throughput. Modern CPUs are designed the chew through the wasted instructions that compilers, especially JIT-compilers, generate. This is why missed-optimizations like this usually don't have a big impact at the macro scale, and why cases where they do matter (e.g. video encoders or matrix multiplies) often still use blocks of hand-written asm.
Often it's possible to hand-hold gcc into making nice asm by implementing your source in a way that's structured like the asm you want. (Like this case: What is the efficient way to count set bits at a position or lower?, and see Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?, for a more general answer about helping the compiler vs. beating the compiler with hand written asm.)
But when your compiler is being braindead like this, there's nothing you can do. Well, except look for workarounds, or stuff like avoiding unsigned
integers narrower than pointers which some other answers have pointed out as being important.
Interestingly, the worst case (2 extra LEA instructions in the loop, plus using extra loop counters) only happens with your if (noLambda)
.
If you make 2 separate versions of the function and remove the if
, the nolambda
version makes a nice clean loop (but misses auto-vectorization of the gather, which would be a win when compiled with -march=skylake
)
I put your code on the Godbolt compiler explorer. (Also interesting, use -funroll-loops
to see which parts are redone every unrolled loop iteration, and which are simply inside the loop once.)
# gcc7.2: the nolamba side of the if, with no actual if()
.L3:
movzwl (%rsi,%rax,2), %ecx
movl (%rdx,%rcx,4), %ecx
movl %ecx, (%r9,%rax,4) # indexed store: no port 7
addq $1, %rax # gcc8 -O3 -march=skylake uses inc to save a code byte here.
cmpq %rax, %r8
jne .L3
On Intel Sandybridge-family, this decodes to 5 uops. (Macro-fusion of the cmp/jcc turns that pair into 1 uop. The other instructions are all single-uop; movzwl
is a pure load and doesn't need an ALU port).
The store un-laminates on SnB/IvB (costing an extra uop for the 4-wide issue stage, one of the main front-end bottlenecks), but can stay fused on HSW/SKL. It can't use port 7, though (because it's indexed), which meaning HSW/SKL would be limited to 2 memory ops per clock, not 3.
Bottlenecks:
Front-end issue bandwidth of 4 fused-domain uops per clock. The loop is 5 uops, and can issue at nearly 1 iteration per 1.25. (Non-multiple-of-4 loops aren't perfect, but 5 uops is handled well on Haswell/Skylake at least. Possibly not Sandybridge.)
load / store execution ports: Haswell and later can run 2 loads + 1 store per clock, but only when the store avoids an indexed addressing mode so the store-address uop can run on port 7.
the lambda version gets a 2nd loop counter (pointer increment) and a silly movl %ecx, %eax
, but the LEA instructions stay out of the loop.
But it's not the extra computation per-se, it's the total uop throughput that's probably hurting your loop. If the dictionary mostly stays hot in cache, a Haswell or later CPU
I was going to write more, but I didn't finish. Posting now because the generic early/mid part is apparently what the question is really about. Don't blindly trust gcc.
And don't expect that it will make optimal code most of the time. You can often gain 10 or 20% just by tweaking the C source (and sometimes much more). Sometimes gcc just doesn't seem to have a clue, like using extra lea
s for no apparent reason when unrolling, instead of using a displacement in an addressing mode. I think its addressing-mode cost model must not be accurate, at least for -march=haswell
/ -march=skylake
.
I tried running your code and... surprise: the instructions executed when you are in the loop are not the ones you see in the compiler explorer link you posted. Check this out (I added a main function) https://godbolt.org/g/PPYtQa The instructions executed while you are in the loop are 162-167, i.e.
.L15:
movzwl 25(%rbx,%rdx), %ecx
movl 5(%rbx,%rcx,4), %ecx
movl %ecx, 0(%rbp,%rdx,2)
addq $2, %rdx
cmpq $180, %rdx
jne .L15
You can double check this by compiling on your machine
g++ test.cpp -std=c++1z -g -O3
and running with gdb
> gdb a.out
(gdb) break funnyEval
(gdb) layout split #shows assebly
(gdb) stepi #steps to the next instruction
The compiler generates a different non-inlined version of funnyEval (which is the one you saw in the disassembled output), even if the one which is actually used is inlined. I have no idea (yet) why the two are different, but I guess that if you are hit by a performance penalty you could fix it by making sure that funnyEval gets inlined: either by defining in a header file or by compiling and linking with link-time optimization (-flto). I'll give it a try to see what happens when funnyEval is in another translation unit...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With