In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC
but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c
is a workaround to get my hands on the carry flag and add it to b
and X
. It looks I got luck and the (g++ -O3
, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X
that I've tested the code is as above (except, of course for the immediate value $0x0
that changes accordingly). I found one exception though: when X == -1
(or 0xFFFFFFFFu
or ~0u
, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3
, version 8.8.0, always uses ADC
(as I wanted) and icc -O3
, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32
but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32
correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c
and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0
I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1
I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC
but this is clearly not the most efficient code. (What's dl
doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov
+ adc $-1, %eax
is more efficient than xor
-zero + setc
+ 3-component lea
for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc
pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc
proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc
function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b
idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc
where the middle adc
has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1
wraps around to 0, so sum = a+b+carry_in
/ carry_out = sum < a
can't optimize to an adc
because it needs to ignore carry in the special case where a = -1
and carry_in = 1
.
So another guess is that maybe gcc considered doing the + X
earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32
correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0
as the carry in and it will optimize to add
instead of adc
, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128
where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add
/adc
/adc
/adc
. GCC makes a mess, using setcc
to store CF to an integer for some of the steps, then add dl, -1
to put it back into CF for an adc
.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc
is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp
makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc
is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc
would still be a better choice.
https://agner.org/optimize/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With