Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does GCC emit a repeated `ret`?

Tags:

c++

x86

gcc

clang

In the C++ example below, I define a function that computes the height of the leftmost path of a tree.

struct TreeNode {
  int value{};
  TreeNode *l = nullptr;
  TreeNode *r = nullptr;
};

int getLeftHeight(TreeNode *root) {
  int height = 0;
  while (root != nullptr) {
    root = root->l;
    height++;
  }
  return height;
}

When compiling with GCC 9.3 or 10.1 (using -O3), I get the following x86.

getLeftHeight(TreeNode*):
        xor     eax, eax
        test    rdi, rdi
        je      .L4
.L3:
        mov     rdi, QWORD PTR [rdi+8]
        add     eax, 1
        test    rdi, rdi
        jne     .L3
        ret
.L4:
        ret

However, when compiling using Clang 10, as shown below, there is not a repeated ret.

getLeftHeight(TreeNode*):
        xor     eax, eax
        test    rdi, rdi
        je      .LBB0_3
.LBB0_1:
        mov     rdi, qword ptr [rdi + 8]
        add     eax, 1
        test    rdi, rdi
        jne     .LBB0_1
.LBB0_3:
        ret

Why is GCC doing this?

As far as I understand, the ret will take one extra byte to be encoded. Maybe this doesn't matter in this case because of alignment/padding of the emitted machine code, but if this is not the case and this duplication happens all over the place in a large codebase, many bytes of machine code could end up being just duplicate rets.

This particular case of duplication also seems fairly simple to spot and optimize away.

like image 556
Bernardo Sulzbach Avatar asked Jun 24 '20 13:06

Bernardo Sulzbach


1 Answers

This seems to be a missed optimization in GCC.

There is a bug #71923 with a very similar example tagged "missed-optimization".

Having said that, the impact on performance is marginal - mainly on code size. In addition, when the second ret is aligned (which will often be the case with modern x86 targets), eliminating the first ret would often have no impact on code size, but if eliminated, falling through from the loop would incur decoding the additional nop instruction before ret, which can make that code path slower.

Anyway, a pull request to GCC is always welcome. You can see the list of optimization passes with -fdump-passes and dump the RTL/GIMPLE tree using -fdump-tree-... flags (or see it on Godbolt). The main question is if an existing optimization pass is supposed to take care of this or if a new one needs to be added (I'm not an expert on GCC, so can't advise one way or another).

like image 85
rustyx Avatar answered Nov 12 '22 11:11

rustyx