Explanation for GCC compiler optimisation's adverse performance effect?

Question

Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.

The program

The following code counts the number of Fibonacci primes up to m:

int main() {
  unsigned int m = 500000000u;
  unsigned int i = 0u;
  unsigned int a = 1u; 
  unsigned int b = 1u; 
  unsigned int c = 1u;
  unsigned int count = 0u;

  while (a + b <= m) {
    for (i = 2u; i < a + b; ++i) {
      c = (a + b) % i;

      if (c == 0u) {
        i = a + b;
        // break;
      }
    }
    
    if (c != 0u) {
      count = count + 1u;
    } 
  
    a = a + b;
    b = a - b;
  }

  return count; // Just to "output" (and thus use) count
}

When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.

I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.

Additional observations

An explicit break in the innermost if makes the -O1 code fast again:

if (c == 0u) {
  i = a + b; // Not actually needed any more
  break;
}

As does "inlining" the loop's progress expression:

for (i = 2u; i < a + b; ) { // No ++i any more
  c = (a + b) % i;

  if (c == 0u) {
    i = a + b;
    ++i;
  } else {
    ++i;
  }
}

Questions

Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

amonakov · Accepted Answer

The important thing at play here are loop-carried data dependencies.

Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:

.L4:
        xorl    %edx, %edx
        movl    %esi, %eax
        divl    %ecx
        testl   %edx, %edx
        cmove   %esi, %ecx
        addl    $1, %ecx
        cmpl    %ecx, %esi
        ja      .L4

See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.

Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.

Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).

In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.

Finally, to give specific answers to your questions:

1. Which optimisation does/could explain the performance drop?

As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.

2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?

Perhaps you can imagine the optimization transforming the code to

    for (i = 2u; i < a + b; ++i) {
      c = (a + b) % i;
      i = (c != 0) ? i : a + b; 
    }

Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.

3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.

Explanation for GCC compiler optimisation's adverse performance effect?

Tags:

c++

performance

compiler-optimization

gcc

The program

Additional observations

Questions

Malte Schwerhoff

1 Answers

amonakov

Recent Activity

Donate For Us

Explanation for GCC compiler optimisation's adverse performance effect?

Tags:

c++

performance

compiler-optimization

gcc

The program

Additional observations

Questions

Malte Schwerhoff

1 Answers

amonakov

Related questions

Recent Activity

Donate For Us