Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inlining of virtual functions (Clang vs GCC)

Take this stupid example:

class Base
{
  public:
    virtual void ant() { i++; };
    virtual void dec() { i--; };
  int i;
};

void function(Base * base) 
{ 
  base->ant(); 
  base->dec(); 
}

The way I imagined this to be implemented by the compiler was two virtual function calls. Clang does just that (using a tail call for the call to dec()):-

function(Base*):                      # @function(Base*)
        push    rbx
        mov     rbx, rdi
        mov     rax, qword ptr [rbx]
        call    qword ptr [rax]
        mov     rax, qword ptr [rbx]
        mov     rdi, rbx
        pop     rbx
        jmp     qword ptr [rax + 8]     # TAILCALL

GCC on the other hand goes a long way out of it's way to partially inline the function calls:

Base::ant():
        add     DWORD PTR [rdi+8], 1      # this_2(D)->i,
        ret
Base::dec():
        sub     DWORD PTR [rdi+8], 1      # this_2(D)->i,
        ret
function(Base*):
        push    rbx     #
        mov     rax, QWORD PTR [rdi]      # _3, base_2(D)->_vptr.Base
        mov     rbx, rdi  # base, base
        mov     rdx, QWORD PTR [rax]      # _4, *_3
        cmp     rdx, OFFSET FLAT:Base::ant()      # _4,
        jne     .L4       #,
        mov     rax, QWORD PTR [rax+8]    # _7, MEM[(int (*__vtbl_ptr_type) () *)prephitmp_8 + 8B]
        add     DWORD PTR [rdi+8], 1      # base_2(D)->i,
        cmp     rax, OFFSET FLAT:Base::dec()      # _7,
        jne     .L6       #,
.L12:
        sub     DWORD PTR [rbx+8], 1      # base_2(D)->i,
        pop     rbx       #
        ret
.L4:
        call    rdx     # _4
        mov     rax, QWORD PTR [rbx]      # _3, base_2(D)->_vptr.Base
        mov     rax, QWORD PTR [rax+8]    # _7, MEM[(int (*__vtbl_ptr_type) () *)prephitmp_8 + 8B]
        cmp     rax, OFFSET FLAT:Base::dec()      # _7,
        je      .L12        #,
.L6:
        mov     rdi, rbx  #, base
        pop     rbx       #
        jmp     rax       # _7

Is this really advantageous, and why? It looks like the same number of memory lookups but with an additional comparison thrown in the mix. Perhaps my example is too trivial.

See godbolt for a version you can play with.

like image 337
JCx Avatar asked Jun 17 '16 19:06

JCx


People also ask

Which is better GCC or clang?

Clang is a C, C++, Objective-C, or Objective-C++ compiler that is compiled in C++ based on LLVM and released under the Apache 2.0 license. Clang is mainly used to provide performance superior to that of GCC. Through long-term development and iteration, GCC, Clang, and LLVM have become mature compilers in the industry.

Why is clang faster than GCC?

Clang is much faster and uses far less memory than GCC. Clang aims to provide extremely clear and concise diagnostics (error and warning messages), and includes support for expressive diagnostics. GCC's warnings are sometimes acceptable, but are often confusing and it does not support expressive diagnostics.

Are virtual methods slower?

Virtual functions are by definition slower than their non-virtual counterparts; we wanted to measure the performance gains from inlining; using virtual functions would make the difference even more pronounced.

Why are virtual functions slower?

Virtual functions are slow when you have a cache miss looking them up. As we'll see through benchmarks, they can be very slow. They can also be very fast when used carefully — to the point where it's impossible to measure the overhead.


2 Answers

As you already figured out, gcc is trying to optmistically inline the functions. That is, inline their bodies after checking that the functions are not in fact overridden: it does this by comparing the value of the function pointers in their vtable1. This type of optimization may be referred to as speculative devirtualization and is widely applied in JIT-compiled languages such as C# and Java, while being more difficult, less profitable and less often applied in compiled lanaguages such as C++. Here we use speculative to distinguish it from the variant where the compiler can prove that the devirtualization is possible. In the optmistic case we instead hope that it is valid, and double check at runtime.

This whole topic could fill a book (well, at least a paper or two), so if you want all the gory details on how the various types of devirtualization are implemented, take a look at Hanza's 7 part series on devirtualization. He helped implement in gcc4 and a lot of the content there goes directly to this question. Here I'll focus on some specifics of your example.

Overall, it's a pretty interesting optimization, that could pay big dividends in certain cases. In your specific case, however, it seems a bit questionable. First, note that falls into a class of probabilistic optimizations: optimizations whose benefit or penalty depends on the runtime behavior of the application: in particular, whether the base argument to type actually overrides the ant() and dec() methods. Whether the optimization is profitable will swing from "strictly a bad idea" to "looks somewhat OK" depending on the actual frequencies.

For example, if the actual behavior of the application is to always pass a base with the default ant() and dec() methods shown, the benefits are:

  1. A somewhat lower latency code path. Ignoring common work (such as loading the function pointers from the vtable, and the actual increment), the inlined approach basically adds two cmp + jne pairs, but saves one indirect call, one indirect jmp and two rets. Just counting instructions that is likely to be a wash, but in practice two macro-fused cmp/jmp pairs are going to be very cheap, and also "out of line" with respect to the increment operations, so their latency cost is probably completely hidden. The indirect jumps and ret calls are less likely to be hidden: modern x86 like Haswell is still good at executing these quickly, but there are hard limits like one taken branch per cycle, etc.

    That said, in this case, the difference between the two paths is likely to be fairly small. The two RMW inc and dec operations are likely to take longer than the jumping around.

  2. The branch prediction behavior is likely to be better, most notably when this code is cold. When cold, there is no info in any of the predictors about the likelihood of branches being taken, or their targets (for the case of indirect branches). The two jne calls in inlined case will likely default predicted not-taken, and be predicted correctly2, and the indirect branches will be avoided entirely. The always-virtual-call approach, however, has no chance of correctly predicting the branch target, so you are going to suffer two consecutive branch mispredictions.

On the other hand, when the ant() and dec() methods are always overridden, the optimization is a dead-loss. You added the extra comparisons and taken jumps to the execution path, and then you had to do the indirect call anyway. Furthermore, you bloated the code size by a huge (47 bytes for the gcc version vs. 13 bytes for the clang one), relatively speaking. If your runtime code footprint is large, this will really hurt you in terms of I$ misses.

Peter mentions that even when the checks fail, at least the optimization reduces the number of targets (by 1) for the subsequent indirect branch. However, you still need to include the mispredicts for the cmp/jne call in the analysis. When you do that, it seems like the check-base-first approach is always either tied or worse - at least in terms of expected mispredicts.

A couple of quick examples:

Two targets (x, y) for ant() with probabilities p(x) and p(y).

Assume p(x) > p(y) without loss of generality1.

Check first:

jne predicts x always so expected jne misses: p(y)
indirect call predicts y always, with zero expected misses expected
misses = p(y)

Virtual call only:

BTB predicts x
expected misses: p(y)

So both cases are exactly the same, with p(y) expected misses.

Three targets, with probabilities x, y, z (we skip the p(x) notation) Case x > y + z

Assume for simplicity that that y == z. You can work through the case where y != z. It doesn't change the qualitative conclusions.

Case x > y + z

Check first:

jne predicts taken (x) always so expected jne misses: y + z = 2y
indirect call predicts y, with z (==y) expected misses
misses = x*(0 + 0) + y*(1 + 0) + z*(1 + 1)
= y + 2z = 3y

Virtual call only:

BTB predicts x
misses = x*0 + y + z = 2y

So in this case, check-first is dominated by virtual-only in mispredict chance, suffering 50% more mispredicts (proportional to the probability of the two less common targets). At the boundaries, when p(y) == p(z) == 0, it ties (but that means there aren't really three targets), and when p(x) == p(y) + p(z) it suffers an expected 0.75 mispredicts per call compared to 0.5 for the virtual-call-only approach.

Let's check the x < y + z case.

Check first:

jne predicts not taken (y or z) always so expected jne misses: x
indirect call predicts y always, with z expected misses
misses = x*(1 + 0) + y*(0 + 0) + z*(0 + 1)
= x + z

Virtual call only:

BTB predicts x
expected misses: y + z

So here again the virtual-call dominates the check-first approach. The latter suffers p(x) - p(y) additional misses. The worst case seems to be when p(x) == ~0.49... and p(y) == p(z) == ~0.25 misses, where again the virtual approach suffers ~0.25 additional misses per call. The other boundary conditions is again a tie when p(z) == 0 (expected, since that's the two-target case).

Now all of the above assume that taken vs not-taken branch mispredicts are equal in cost to branch targets mispredicts. On modern x86 CPUs the cost does in fact seem to be about the same. Each type of mispredict causes a full redirect and a penalty of 15-20 cycles. There are still second order effects present - for example, an indirect branch may be resolved later than a conditional direct one if the jump address takes longer to calculate than the branch condition. That doesn't seem to be the case here since both the branch decision and the branch address depend on the same thing: the address of the ant() function in the vtable.

The above also assumes that indirect branches are equally well predicted compared to conditional branches, and that this prediction consumes an equal amount of resources. This is, in general, not true. Processors generally have fewer resources dedicated to indirect BTB entries, and even given equal resources, getting to a given prediction rate requires more resources in the IBTB scenario versus the conditional branch scenario, since IBTB state (target) is larger than a single bit of taken not-taken info. Smaller or older CPUs may not have any indirect branch prediction capability as well, but this difference is real even in modern x86 CPUs. It is hard to quantify, but at the limit, when this function is called a lot, the difference disapears (since there are enough resources to track accurately IBTB calls for at least the hottest calls).

If you got this far, you might well fell that, overall, this seems like a questionable optimization, in this specific case. The potential upside is fairly small, and the cost in terms of code bloat is large. Furthermore, in the intermediate cases (where base.ant() is sometimes equal to Base::ant), the proposition is questionable since the increased mispredicts eat into the benefit of inlining the call.

On the face of it, I would tend to agree - but there are a couple of mitigating factors:

First, gcc is actually trying to be smart about when it applies this optimization. It applies this optimization only when it can't see that the methods in question are actually overloaded. Take a look at this small modification of your example. The function is unchanged, but we add an (unused in this file) subclass of Base which does override the methods. Now, gcc no longer does the speculative inlining. It sees that the method is overridden, so it doesn't find the inlining worth it.

This whole idea of what can gcc see is pretty important. Usually gcc is looking only at a single compilation unit. That greatly restricts its visibility. You can make a reasonable argument that, in general, Base would be overridden in a separate compilation unit, if at all, so the behavior noted above (gcc deciding to apply the speculative inlining based on whether an override exists) isn't too useful, since same-file overrides are rare.

On the other hand, you might note that goldbolt is putting your class definition in a .cpp file5. It is very rare and bad practice for someone to include a .cpp file in another compilation unit. So the guess that Base isn't overridden is perhaps a pretty good one in this case. Of course, I don't gcc actually uses that info - by the time the actual compiler sees the file, the distinction between header and .cpp files has pretty much been lost.

Whether a class is likely to be overridden, when your scope is a single compilation unit is nearly as much a philosophical question as a technical one. This is exactly the kind of question that LTO and PGO (as Peter mentioned) is supposed to solve. In the LTO case, by deferring optimization to link-time, the entire set of statically available6 classes and method overrrides is available for examination. In the case of PGO, the compiler can use information about which classes actually appear as targets at every call site, and optimize based on the observed overrides. It can even make different decisions for different call sites (assuming function() itself can be inlined). PGO can approach the quality of devirtualization that is usually reserved for JIT-compiled languages.

In fact, this topic is important enough that Jan gives it an entire entry on his series about devirtualization. Of particular importance, he covers cases where the compiler can be sure that it knows there are no subclasses/method overrides, and hence devirtualizaton is no longer speculative: classes in anonymous namespaces, local classes, and final methods and classes.

A final note is in defense of gcc's decision to speculatively inline this. The example given is more or less at one end of the risk-vs-reward spectrum for this optimization. As it turns out, the functions involved only take a single instruction to implement (aside from the required ret, one of which is even optimized away by the tailcall). So the benefit of inlining is very small, because the benefit of inlining is mostly as a "granddaddy" optimization that enables lots of other optimizations to work, and often eliminates a large part of the cost of the inlined function. Because the function is so small, with zero prologue and epilogue, and because can't easily optimize between ant() and dec(), because they are called in different basic blocks, inlining doesn't help very much.

Here's another example which gives gcc more of a chance to optimize:

class Base
{
  public:
    virtual int add5(int x) { return 5 + x; };
};

int function(Base * base, int x) 
{ 
  return base->add5(x) - base->add5(x);
}

int add5_static(int x) {
  return 5 + x;
}

int function2(int x) {
  return add5_static(x) - add5_static(x);
}

int function3(int x) {
  Base b = Base();
  return b.add5(x) - b.add5(x);
}

Here, you call the same function more than once. This could allow gcc to optimize the function pointer checks (you only need one for add5 not two for ant and dec). This could allow gcc to optimize between the two function calls, replacing(5 + x) - (5 + x)with something as simple as0`.

Let's see what gcc does with that, via godbolt! Looks good initially. The inlined version of the call only requires one cmp/jne since the same function is called twice. Here's the optimized version, starting with the function pointer check:

    cmp     rax, OFFSET FLAT:Base::add5(int)  # _4,
    jne     .L3       #,
    add     esi, 5    # _11,
    mov     ebp, esi  # _7, _11
    mov     eax, ebp  # _7, _7
    pop     rbx       #
    sub     eax, esi  # _7, _11
    pop     rbp       #
    pop     r12       #
    ret

It's a mixed bag. There are 7 instructions after the jump, and a lot of redundancy. First, note that gcc is able to do inter-call optimization. In particular, the single call to add esi, 5 shows that the common sub-expression 5 + part of the two (identical calls) was optimized to a single call. You then get eax = ebp = esi. The assignment to ebp is redundant - it is never used again in this function. Then you get sub eax,esi. This is totally redundant because eax == esi, and so the result is always zero.

Even the pop instructions are all redundant. We simply push those three registers at the top of the method, then pop them off in the optimized function. It is legitimate to push these registers before we call the virtual methods, but that can all be deferred until after the check. So a total of six push and pop instructions are unnecessary in the optimized path.

All that to say that the optimized implementation of these 7 instructions is simply xor eax, eax, a near-free instruction (in addition to avoiding the three earlier push instructions, not shown, which can be also avoided7). Gcc missed all the easy optimizations, and the optimized function is perhaps an order of magnitude slower. The reality is that even though these optimizations are "obvious", everything occurs in stages. The stage that can remove all the redundant moves, the subtraction and the pushes and pops probably occurs before the devirtualization phase. Later on, it is too late for this redundancy to be removed.

Just to show that gcc is indeed capable of optimizing this in a better way, take a look at function 2:

int add5_static(int x) {
  return 5 + x;
}

int function2(int x) {
  return add5_static(x) - add5_static(x);
}

This is the same as function, except without the complication of possibly virtual function calls. The entire function is simply:

function2(int):
    xor     eax, eax  #
    ret 

So everything cancels out, into return 0;. The compiler could do the same thing in the speculatively devirtualized version of add5, but it fails to.

Oddly enough, the non-speculative devirtualization does just fine:

int function3(int x) {
  Base b = Base();
  return b.add5(x) - b.add5(x);
}

reduces to exactly the same thing:

function3(int):
    xor     eax, eax  #
    ret

So something about speculative devirtualization seems to make it less susceptible to many of the optimizations that would otherwise lead to some great simplifications of the code and lead to some big wins. Even without those, the optimized version is likely to be noticeably faster than the version without devirtualization.


1It is worth noting that this is a more precise check than typical checks than what occurs when devirtualizing JITed languages. In Java, for example, the check is simply against the object type (i.e., the value of the vtable pointer), rather than against the particular function implementation (i.e., against the value of the function pointer in the vtable). The function-specific check is more costly, since it involves dereferencing the vtable pointer, but it "works" in many more cases: it would work for any subclass of Base that didn't actually override inc() and/or dec(), while the class type check would fail completely.

The difference is probably down to when the optimization is applied: because the JITed code knows the most common class(es) at the call site based on profiling, even the coarser (and cheaper) check is going to be effective, since it uses the most common observed type. The C++ code doesn't have this benefit (unless LTO and PGO are used), so the more careful check probably pays off.

2In fact, the actual behavior is complex and depends specifically on the exact processor version. In older processors, the branch predictors were actually much more likely to use the default predictor for cold branches (which predicts not-taken for forward branches), since the prediction mechanisms were tied tightly to the IP of the branch.

In more recent predictors (Haswell and newer), the mechanism is more complex, with a hash of several factors being used, so you may often "collide" with other existing state even in the cold case, and so behavior is not as predictable any more. I think it's safe you say you won't usually get more than a 50% mispredict rate for forward, not taken branches, however. You can find some investigation on this blog and discussion here.

3OK, you caught me. It excludes the case p(x) == p(y), but you can work that through too and it doesn't change anything.

4A binary search on godbolt as well as Hanza's blog confirms that this was added in gcc 4.9. Prior to that version, it is compiled in the same was as clang and icc.

5In particular, the #pragma message shows that the source is all compiled in a single file called example.cpp.

6By statically available I mean the set of LTO-enabled files that are available to gcc when the LTO phase occurs. This excludes at least two important categories of code: (1) code in static or dynamic objects which are linked to the final binary but for which LTO infomration is not available (pretty much everything you link which isn't part of your compile process) and (2) additional code loaded at runtime, e.g., via dlopen, which LTO can't possibly account for.

7Peter Cordes points out that this "push everything up front" behavior - where everything that needs to be saved, through any possible path through a function is pushed immediately on entry, seems to be a pattern in gcc. In the case of functions with very short fast paths, this seems like a big limitation.

like image 129
BeeOnRope Avatar answered Sep 28 '22 06:09

BeeOnRope


gcc assumes that two predictable branches are cheaper than two indirect calls. That's certainly true if it turns out that it guessed right and the calls it inlined are the ones that should happen. Profile-guided optimization (-fprofile-generate / -fprofile-use) will probably detect if this is not the case, and speculatively inline something else if appropriate. (Or not, maybe it's not that smart).

Indirect branches are potentially quite expensive, since the branch-predictor has to correctly predict the target address. Two not-taken branches are very cheap if they're predictable. I'm a bit surprised that gcc doesn't check that both addresses are the base-class, and not touch i at all in that case (since the inc and dec cancel out.)


Modern CPUs are quite good at blowing through a lot of instructions quickly, when there's a decent amount of instruction-level parallelism like in this case. See Agner Fog's microarch guide, and other links in the x86 tag wiki.

like image 26
Peter Cordes Avatar answered Sep 28 '22 06:09

Peter Cordes