I want to do some long integer math (128 bit) with Intel I64 Assembler and need to create a 2's complement. Let's say that my positive value is in RDX:RAX.
2's complement is done by "flip the bits and add 1". So the most naive implementation is (4 instructions and 14 bytes of code):
NOT RAX
NOT RDX
ADD RAX,1 ; Can't use INC, it doesn't set Carry
ADC RDX,0
When I use the NEG instruction on RAX instead of NOT, it does the "+1" for me but the Carry is wrong, NEG RAX clears Carry when RAX was zero, but I need the Carry JUST IN THIS CASE. So the next best approach might be (4 instructions and 11 bytes of code):
NOT RDX
NEG RAX
CMC
ADC RDX,0 ; fixed, thanks lurker
Still 4 instructions. But instead of adding +1, I can subtract -1 and since SBB adds the Carry-Bit to the subtrahend, I will add +1 when Carry is clear. So my next best try is this, with 3 instructions and 10 bytes of code:
NOT RDX
NEG RAX
SBB RDX,-1
As you can see from my long winded text, this is not obvious to understand. Is there a better, more understandable way to do a cascaded 2's complement in assembler?
To convert a negative signed binary integer to 2’s complement, 1. Fix the number of bits. Write zeros in the extra places to the left. 2. Replace each bit by 1 – bit. Notice this amounts to reversing the 0’s and 1’s. 3. Add 1, ignoring any carry out of the left most bit.
The 2’s complement of a positive number is the same as the ordinary binary, with leading zeros affixed for emphasis. Signed decimal Signed binary 2’s complement binary (8 digits)
In the (8-bit) two's complement representation of a integer value between -127 and 127, positive integers are represented in the normal binary manner (with a natural leading zero as the largest number is 127). Representations for negative numbers are obtained from the representations for positive numbers in the following way:
Add 1, ignoring any carry out of the left most bit. For example, find the 2’s complement binary representation of −45 using eight bits. To find the representation, first convert 45 to binary. 45 10110110= Now pad with zeros on the left to make eight bits. 00101101 Switch 0’s and 1’s and add 1. 11010010 + 1 11010011 To sum up,
Shorter instructions or less number of instructions doesn't necessarily means faster execution because the latency and throughput for each instruction are different
For example obsolete instructions like enter
, dad
, loop
... will be very slow and they're there only for backward compatibility. Even inc
is sometimes slower than add
. The same to cmc
you used above on some μarchs
As a result a longer series of low latency instructions that can be executed in parallel will work much faster. Some common groups of instructions can even be fused together into a single macro-op. The compilers' optimizers always know this and will choose the most appropriate instructions to emit.
For this snippet
__int128 negate(__int128 x)
{
return -x;
}
ICC 19.0.1 will generate the following instructions
xor edx, edx #3.13
xor eax, eax #3.13
sub rax, rdi #3.13
sbb rdx, rsi #3.13
The first two xor instructions cost zero μop, because they're handled at the register-rename stage. Now you have only 2 instructions to execute
You can switch the compiler in Godbolt link above to see various ways to negate by different compilers including MSVC (unfortunately it doesn't have a 128-bit type yet). Below are results for GCC and Clang
GCC 8.3:
mov rax, rdi
neg rax
mov rdx, rsi
adc rdx, 0
neg rdx
Clang:
mov rax, rdi
xor edx, edx
neg rax
sbb rdx, rsi
As you can see, Clang also uses only 3 instructions (minus the first one to move data from the input argument to the necessary destination). But like xor reg, reg
, mov
can also be "free"
Things may be different if you optimize for space (like in some cases where cache misses are high), because some immediates and instructions are long
Whether it's faster or not need some micro benchmarking. But on Intel CPUs, Intel compiler (ICC) often achieves higher performance than others because it understands the architecture better.
Note that that operation is called negation, not two's complement which is a way to encode negative numbers
BTW, negating a 2-register number is the same in 32-bit or 16-bit mode, with EDX:EAX or DX:AX. Use the same instruction sequences.
To copy-and-negate, @phuclv's answer shows efficient compiler output. The best bet is xor-zeroing the destination and then using sub
/ sbb
.
That's 4 uops for the front-end on AMD, and on Intel Broadwell and later. On Intel before Broadwell, sbb reg,reg
is 2 uops. The xor-zeroing is off the critical path (can happen before the data to be negated is ready), so this has a total latency of 2 or 3 cycles for the high half. The low half is of course ready with 1 cycle latency.
Clang's mov/neg
for the low half is maybe better on Ryzen, which has mov-elimination for GP integer, but still needs an ALU execution unit for xor-zeroing. But for older CPUs, it puts a mov
on the critical path for latency. But usually back-end ALU pressure is not as big a deal as front-end bottlenecks, for instructions that can use any ALU port.
To negate in-place, use neg
to subtract from 0
neg rdx ; high half first
neg rax ; subtract RDX:RAX from 0
sbb rdx, 0 ; with carry from low to high half
neg
is exactly equivalent to sub
from 0, as far as setting flags and performance.
ADC/SBB with an immediate 0
is only 1 uop on Intel SnB/IvB/Haswell, as a special case. It's still 2 uops on Nehalem and earlier, though. But without mov-elimination, mov
to another register and then sbb
back into RDX would be slower.
The low half (in RAX) is ready in the first cycle after it's ready as an input to neg
. (So out-of-order execution of later code can get started using the low half.)
The high half neg rdx
can run in parallel with the low half. Then sbb rdx,0
has to wait for rdx
from neg rdx
and CF from neg rax
. So it's ready at the later of 1 cycle after the low half, or 2 cycles after the input high half is ready.
The above sequence is better than any in the question, being fewer uops on very common Intel CPUs. On Broadwell and later (single-uop SBB
, not just for immediate 0)
;; equally good on Broadwell/Skylake, and AMD. But worse on Intel SnB through HSW
NOT RDX
NEG RAX
SBB RDX,-1 ; can't use the imm=0 special case
Any of the 4-instruction sequences are obviously sub-optimal, being more total uops. And some of them have worse ILP / dependency chains / latency, like 2 instructions on the critical path for the low half, or a 3-cycle chain for the high half.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With