Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC optimizes fixed range-based for loop as if it had longer, variable length

I have an array of POD structs and am trying to sum across one field. Here's a minimal example:

struct Item
{
    int x = 0;
    int y = 0;
};

typedef Item Items[2];

struct ItemArray
{
    Items items;

    int sum_x1() const;
    int sum_x2() const;
};

int ItemArray::sum_x1() const
{
    int total = 0;
    for (unsigned ii = 0; ii < 2; ++ii)
    {
        total += items[ii].x;
    }
    return total;
}

int ItemArray::sum_x2() const
{
    int total = 0;
    for (const Item& item : items)
    {
        total += item.x;
    }
    return total;
}

The two sum functions do the same thing. Clang compiles them identically. But GCC 6 with -O3 on x86_64 does not. Here's sum_x1(), looking good:

  mov eax, DWORD PTR [rdi+8]
  add eax, DWORD PTR [rdi]
  ret

Now look at sum_x2():

  lea rdx, [rdi+16]
  lea rcx, [rdi+8]
  xor eax, eax
  add eax, DWORD PTR [rdi]
  cmp rdx, rcx
  je .L12
  lea rcx, [rdi+16]
  add eax, DWORD PTR [rdi+8]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+24]
  add eax, DWORD PTR [rdi+16]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+32]
  add eax, DWORD PTR [rdi+24]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+40]
  add eax, DWORD PTR [rdi+32]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+48]
  add eax, DWORD PTR [rdi+40]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+56]
  add eax, DWORD PTR [rdi+48]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+64]
  add eax, DWORD PTR [rdi+56]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+72]
  add eax, DWORD PTR [rdi+64]
  cmp rdx, rcx
  je .L2
  add eax, DWORD PTR [rdi+72]
  ret
.L2:
  rep ret
.L12:
  rep ret

Why does GCC emit an unrolled loop of variable length up to 10 when there are the loop length is fixed at 2? It only does this in a member function--making sum_x2 a free function fixes it.

ICC also optimizes sum_x2() very strangely, though the generated code is totally different. Unlike GCC, it doesn't matter whether sum_x2() is a member function or a free function--both are bad.

I'm using GCC 6, but all versions of GCC seem to have problems with this code. Adding -march=haswell makes it even worse, adding iterations for up to 15 elements in the array of size 2. GCC 5 and 7 generate even more complex code, adding SIMD instructions.

I would like to identify the exact cause of this problem, so that I can locate and fix similar occurrences in my code. Understanding what triggers this behavior in GCC 6 would be very helpful. I have lots of range-based for loops in my code, and I am not too excited at the prospect of removing them, but if GCC can't generate reasonable code I will have no choice.

Try it: https://godbolt.org/g/9GK4jy

More related insanity: https://godbolt.org/g/BGYggD (optimal code is 3 instructions; GCC 6 produces 8 instructions; GCC 7 produces 130 instructions)

like image 438
John Zwinck Avatar asked Aug 04 '17 01:08

John Zwinck


1 Answers

As described by Richard Biener in my bug report, the problem seems to be that GCC prior to version 8 failed to understand that a field of a class or struct was subject to the same optimizations (e.g. constant loop count) as a regular variable. So it would emit all sorts of fancy code to optimally loop an unknown number of times, even when it was known at compile time, in the case where the container was a member variable.

The way I understand it, this bug probably affects quite a bit of code in the wild--e.g. anywhere a member small array is the subject of a C++11 range-based for loop.

Thanks to Richard Biener for the prompt resolution (targeted for GCC 8).

like image 164
John Zwinck Avatar answered Jan 05 '23 01:01

John Zwinck