Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using carry flag after shift operation

Tags:

intel

For the div/mod part of the following code:

int pow(int x, unsigned int n)
{
    int y = 1;
    while (n > 1)
    {
        auto m = n%2;
        n = n/2;
        if (m)
            y *= x;
        x = x*x;
    }
    return x*y;
}

I'd expect assembly like

shr n
cmovc y, yx

But gcc/clang and even icc don't use the carry flag here (using 2 registers and and/test instead): https://godbolt.org/z/L6VUZ1

So I wonder what the best way would be if you hand-coded it and why (ILP, dependencies,...).

like image 832
Trass3r Avatar asked Oct 18 '18 11:10

Trass3r


1 Answers

test/je can macro-fuse into a single uop on mainstream Intel and AMD CPUs, so in branchy code you'd only be saving code-size and costing 1 cycle of branch-miss-detection latency by using the CF output of a shift instead of an earlier test/je.

(Unfortunately gcc is really dumb here and uses test edx,edx on the result of and edx,1, instead of just using test dl,1 or better test sil,1 to save the mov as well. test esi,1 would use the imm32 encoding, because there's no test r/m32, imm8 encoding for test, so compilers know to read narrow registers for test.)

But in a branchless implementation like clang uses, yes you'd save a uop by using the CF output for cmovc instead of separately computing an input for cmove with test. You still don't shorten the critical path because test and shr can run in parallel, and mainstream CPUs like Haswell or Ryzen have wide enough pipelines to fully exploit all the ILP and just bottleneck on the imul loop-carried dependency chain. (https://agner.org/optimize/).

Actually it's the cmov -> imul -> next iteration dep chain for y that's the bottleneck. On Haswell and earlier, cmov is 2 cycle latency (2 uops), so the total dep chain is 2+3 = 5 cycles. (Pipelined multipliers mean that doing an extra y*=1 multiply isn't slowing down the x*=x part or vice versa; they can both be in flight at once just not start in the same cycle.)

If you're using the same n repeatedly for different bases, a branchy version should predict well, and be very good because branch-prediction + speculative execution decouples data dependency chains.

Otherwise it's probably better to eat the longer latency of a branchless version instead of suffering branch misses.

like image 200
Peter Cordes Avatar answered Nov 15 '22 06:11

Peter Cordes