For the div/mod part of the following code:
int pow(int x, unsigned int n)
{
int y = 1;
while (n > 1)
{
auto m = n%2;
n = n/2;
if (m)
y *= x;
x = x*x;
}
return x*y;
}
I'd expect assembly like
shr n
cmovc y, yx
But gcc/clang and even icc don't use the carry flag here (using 2 registers and and/test instead): https://godbolt.org/z/L6VUZ1
So I wonder what the best way would be if you hand-coded it and why (ILP, dependencies,...).
test/je
can macro-fuse into a single uop on mainstream Intel and AMD CPUs, so in branchy code you'd only be saving code-size and costing 1 cycle of branch-miss-detection latency by using the CF output of a shift instead of an earlier test/je
.
(Unfortunately gcc is really dumb here and uses test edx,edx
on the result of and edx,1
, instead of just using test dl,1
or better test sil,1
to save the mov
as well. test esi,1
would use the imm32 encoding, because there's no test r/m32, imm8
encoding for test
, so compilers know to read narrow registers for test
.)
But in a branchless implementation like clang uses, yes you'd save a uop by using the CF output for cmovc
instead of separately computing an input for cmove
with test
. You still don't shorten the critical path because test
and shr
can run in parallel, and mainstream CPUs like Haswell or Ryzen have wide enough pipelines to fully exploit all the ILP and just bottleneck on the imul
loop-carried dependency chain. (https://agner.org/optimize/).
Actually it's the cmov
-> imul
-> next iteration dep chain for y
that's the bottleneck. On Haswell and earlier, cmov
is 2 cycle latency (2 uops), so the total dep chain is 2+3 = 5 cycles. (Pipelined multipliers mean that doing an extra y*=1
multiply isn't slowing down the x*=x
part or vice versa; they can both be in flight at once just not start in the same cycle.)
If you're using the same n
repeatedly for different bases, a branchy version should predict well, and be very good because branch-prediction + speculative execution decouples data dependency chains.
Otherwise it's probably better to eat the longer latency of a branchless version instead of suffering branch misses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With