Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explanation for GCC compiler optimisation's adverse performance effect?

Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.

The program

The following code counts the number of Fibonacci primes up to m:

int main() {
  unsigned int m = 500000000u;
  unsigned int i = 0u;
  unsigned int a = 1u; 
  unsigned int b = 1u; 
  unsigned int c = 1u;
  unsigned int count = 0u;

  while (a + b <= m) {
    for (i = 2u; i < a + b; ++i) {
      c = (a + b) % i;

      if (c == 0u) {
        i = a + b;
        // break;
      }
    }
    
    if (c != 0u) {
      count = count + 1u;
    } 
  
    a = a + b;
    b = a - b;
  }

  return count; // Just to "output" (and thus use) count
}

When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.

I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.

Additional observations

  1. An explicit break in the innermost if makes the -O1 code fast again:

    if (c == 0u) {
      i = a + b; // Not actually needed any more
      break;
    }
    
  2. As does "inlining" the loop's progress expression:

    for (i = 2u; i < a + b; ) { // No ++i any more
      c = (a + b) % i;
    
      if (c == 0u) {
        i = a + b;
        ++i;
      } else {
        ++i;
      }
    }
    

Questions

  1. Which optimisation does/could explain the performance drop?

  2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?

  3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

like image 373
Malte Schwerhoff Avatar asked Mar 02 '23 00:03

Malte Schwerhoff


1 Answers

The important thing at play here are loop-carried data dependencies.

Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:

.L4:
        xorl    %edx, %edx
        movl    %esi, %eax
        divl    %ecx
        testl   %edx, %edx
        cmove   %esi, %ecx
        addl    $1, %ecx
        cmpl    %ecx, %esi
        ja      .L4

See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.

Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.

Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).

In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.

Finally, to give specific answers to your questions:

1. Which optimisation does/could explain the performance drop?

As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.

2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?

Perhaps you can imagine the optimization transforming the code to

    for (i = 2u; i < a + b; ++i) {
      c = (a + b) % i;
      i = (c != 0) ? i : a + b; 
    }

Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.

3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.

like image 153
amonakov Avatar answered Mar 04 '23 14:03

amonakov