I have an array of POD structs and am trying to sum across one field. Here's a minimal example:
struct Item
{
int x = 0;
int y = 0;
};
typedef Item Items[2];
struct ItemArray
{
Items items;
int sum_x1() const;
int sum_x2() const;
};
int ItemArray::sum_x1() const
{
int total = 0;
for (unsigned ii = 0; ii < 2; ++ii)
{
total += items[ii].x;
}
return total;
}
int ItemArray::sum_x2() const
{
int total = 0;
for (const Item& item : items)
{
total += item.x;
}
return total;
}
The two sum functions do the same thing. Clang compiles them identically. But GCC 6 with -O3
on x86_64 does not. Here's sum_x1()
, looking good:
mov eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi]
ret
Now look at sum_x2()
:
lea rdx, [rdi+16]
lea rcx, [rdi+8]
xor eax, eax
add eax, DWORD PTR [rdi]
cmp rdx, rcx
je .L12
lea rcx, [rdi+16]
add eax, DWORD PTR [rdi+8]
cmp rdx, rcx
je .L2
lea rcx, [rdi+24]
add eax, DWORD PTR [rdi+16]
cmp rdx, rcx
je .L2
lea rcx, [rdi+32]
add eax, DWORD PTR [rdi+24]
cmp rdx, rcx
je .L2
lea rcx, [rdi+40]
add eax, DWORD PTR [rdi+32]
cmp rdx, rcx
je .L2
lea rcx, [rdi+48]
add eax, DWORD PTR [rdi+40]
cmp rdx, rcx
je .L2
lea rcx, [rdi+56]
add eax, DWORD PTR [rdi+48]
cmp rdx, rcx
je .L2
lea rcx, [rdi+64]
add eax, DWORD PTR [rdi+56]
cmp rdx, rcx
je .L2
lea rcx, [rdi+72]
add eax, DWORD PTR [rdi+64]
cmp rdx, rcx
je .L2
add eax, DWORD PTR [rdi+72]
ret
.L2:
rep ret
.L12:
rep ret
Why does GCC emit an unrolled loop of variable length up to 10 when there are the loop length is fixed at 2? It only does this in a member function--making sum_x2
a free function fixes it.
ICC also optimizes sum_x2()
very strangely, though the generated code is totally different. Unlike GCC, it doesn't matter whether sum_x2()
is a member function or a free function--both are bad.
I'm using GCC 6, but all versions of GCC seem to have problems with this code. Adding -march=haswell
makes it even worse, adding iterations for up to 15 elements in the array of size 2. GCC 5 and 7 generate even more complex code, adding SIMD instructions.
I would like to identify the exact cause of this problem, so that I can locate and fix similar occurrences in my code. Understanding what triggers this behavior in GCC 6 would be very helpful. I have lots of range-based for loops in my code, and I am not too excited at the prospect of removing them, but if GCC can't generate reasonable code I will have no choice.
Try it: https://godbolt.org/g/9GK4jy
More related insanity: https://godbolt.org/g/BGYggD (optimal code is 3 instructions; GCC 6 produces 8 instructions; GCC 7 produces 130 instructions)
As described by Richard Biener in my bug report, the problem seems to be that GCC prior to version 8 failed to understand that a field of a class or struct was subject to the same optimizations (e.g. constant loop count) as a regular variable. So it would emit all sorts of fancy code to optimally loop an unknown number of times, even when it was known at compile time, in the case where the container was a member variable.
The way I understand it, this bug probably affects quite a bit of code in the wild--e.g. anywhere a member small array is the subject of a C++11 range-based for loop.
Thanks to Richard Biener for the prompt resolution (targeted for GCC 8).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With