Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of duplicating vs deduplicating identical conditional code in loops

Say I have code like this, call it version 1:

while (some_condition) {
    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        // .. rare work here ..
        continue;
    }

    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        // .. rare work here ..
        continue;
    }

    // .. more work
}

Assume that the "rare work" is identical across both cases. We could equivalently write version 2:

while (some_condition) {
    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        goto rare_work;
    }

    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        goto rare_work;
    }

    // .. more work
    continue;

  rare_work:
    // .. rare work here ..
}

The compiler should usually be smart enough to make the if() checks result in a jump directly to rare_work, rather than a jump to a block containing a jump to rare_work. The compiler may even transform version 1 to version 2 for us, because it reduces the overall amount of assembly, increasing the chance of fitting in instruction cache.

My question is: if instruction cache is not an issue, is there any reason to expect one or the other to perform better? This is assuming we care about low level micro-optimization and are willing to write the code in assembly, if we have to in order to force getting one version or the other. I realize things like port availability could change based on what the work instructions actually are, I'm just wondering if there is any high level reason to expect a difference before doing such an analysis.

like image 617
Joseph Garvin Avatar asked Sep 02 '25 17:09

Joseph Garvin


1 Answers

My question is: if instruction cache is not an issue, is there any reason to expect one or the other to perform better?

The performance hiccup here would be the number of branches. If there is a branch prediction miss, that's when you lose performance. And the best counter measure is simply to try to reduce the number of branches. Since your code is pseudo code, there's no telling if doing so is possible. But that's what manual optimization should focus on, rather than code re-ordering.


From a pure readability point of view, it would be perfectly possible to rewrite the loop without any "spaghetti programming" aka continue/goto and still keep the same performance:

while(some_condition) 
{
  bool is_rare;
    
  is_rare = work1();
  if(is_rare)
  {
    rare();
  }
  else
  {
    is_rare = work2();
    if(is_rare)
    {
      rare();
    }
  }
}

Where all functions are static inline. While trying this out in gcc x86 -O3 I got assembler code pretty much identical to your goto version, with rare() inlined at the bottom of the loop. Which at least proves that we don't have to write spaghetti on the C level.

Artificial example code: https://godbolt.org/z/M8EznYr1z. .L5: is the rare() function inlined.

do_stuff:
        sub     rsp, 24
.L12:
        movzx   eax, BYTE PTR [rsp+15]
        test    al, al
        je      .L1
.L14:
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        cmp     BYTE PTR rare_[rip], 0
        jne     .L5
        mov     edi, OFFSET FLAT:.LC2
        call    puts
        cmp     BYTE PTR rare_[rip], 0
        je      .L12
.L5:
        mov     edi, OFFSET FLAT:.LC1
        call    puts
        movzx   eax, BYTE PTR [rsp+15]
        test    al, al
        jne     .L14
.L1:
        add     rsp, 24
        ret
like image 58
Lundin Avatar answered Sep 05 '25 09:09

Lundin