Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Two's complement of long integer

I want to do some long integer math (128 bit) with Intel I64 Assembler and need to create a 2's complement. Let's say that my positive value is in RDX:RAX.

2's complement is done by "flip the bits and add 1". So the most naive implementation is (4 instructions and 14 bytes of code):

  NOT RAX
  NOT RDX
  ADD RAX,1   ; Can't use INC, it doesn't set Carry
  ADC RDX,0

When I use the NEG instruction on RAX instead of NOT, it does the "+1" for me but the Carry is wrong, NEG RAX clears Carry when RAX was zero, but I need the Carry JUST IN THIS CASE. So the next best approach might be (4 instructions and 11 bytes of code):

  NOT RDX
  NEG RAX
  CMC
  ADC RDX,0                  ; fixed, thanks lurker

Still 4 instructions. But instead of adding +1, I can subtract -1 and since SBB adds the Carry-Bit to the subtrahend, I will add +1 when Carry is clear. So my next best try is this, with 3 instructions and 10 bytes of code:

   NOT RDX
   NEG RAX
   SBB RDX,-1

As you can see from my long winded text, this is not obvious to understand. Is there a better, more understandable way to do a cascaded 2's complement in assembler?

like image 347
Rolf Avatar asked Apr 03 '15 20:04

Rolf


People also ask

How to convert a negative binary integer to 2’s complement?

To convert a negative signed binary integer to 2’s complement, 1. Fix the number of bits. Write zeros in the extra places to the left. 2. Replace each bit by 1 – bit. Notice this amounts to reversing the 0’s and 1’s. 3. Add 1, ignoring any carry out of the left most bit.

What is the 2's complement of a number?

The 2’s complement of a positive number is the same as the ordinary binary, with leading zeros affixed for emphasis. Signed decimal Signed binary 2’s complement binary (8 digits)

How are integers between-127 and 127 represented in two's complement?

In the (8-bit) two's complement representation of a integer value between -127 and 127, positive integers are represented in the normal binary manner (with a natural leading zero as the largest number is 127). Representations for negative numbers are obtained from the representations for positive numbers in the following way:

How to find the 2’s complement of -45 using eight bits?

Add 1, ignoring any carry out of the left most bit. For example, find the 2’s complement binary representation of −45 using eight bits. To find the representation, first convert 45 to binary. 45 10110110= Now pad with zeros on the left to make eight bits. 00101101 Switch 0’s and 1’s and add 1. 11010010 + 1 11010011 To sum up,


2 Answers

Shorter instructions or less number of instructions doesn't necessarily means faster execution because the latency and throughput for each instruction are different

For example obsolete instructions like enter, dad, loop... will be very slow and they're there only for backward compatibility. Even inc is sometimes slower than add. The same to cmc you used above on some μarchs

As a result a longer series of low latency instructions that can be executed in parallel will work much faster. Some common groups of instructions can even be fused together into a single macro-op. The compilers' optimizers always know this and will choose the most appropriate instructions to emit.

For this snippet

__int128 negate(__int128 x)
{
    return -x;
}

ICC 19.0.1 will generate the following instructions

    xor       edx, edx                                      #3.13
    xor       eax, eax                                      #3.13
    sub       rax, rdi                                      #3.13
    sbb       rdx, rsi                                      #3.13

The first two xor instructions cost zero μop, because they're handled at the register-rename stage. Now you have only 2 instructions to execute

You can switch the compiler in Godbolt link above to see various ways to negate by different compilers including MSVC (unfortunately it doesn't have a 128-bit type yet). Below are results for GCC and Clang

GCC 8.3:

    mov     rax, rdi
    neg     rax
    mov     rdx, rsi
    adc     rdx, 0
    neg     rdx

Clang:

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    sbb     rdx, rsi

As you can see, Clang also uses only 3 instructions (minus the first one to move data from the input argument to the necessary destination). But like xor reg, reg, mov can also be "free"

Things may be different if you optimize for space (like in some cases where cache misses are high), because some immediates and instructions are long

Whether it's faster or not need some micro benchmarking. But on Intel CPUs, Intel compiler (ICC) often achieves higher performance than others because it understands the architecture better.

Note that that operation is called negation, not two's complement which is a way to encode negative numbers

like image 162
phuclv Avatar answered Sep 21 '22 13:09

phuclv


BTW, negating a 2-register number is the same in 32-bit or 16-bit mode, with EDX:EAX or DX:AX. Use the same instruction sequences.


To copy-and-negate, @phuclv's answer shows efficient compiler output. The best bet is xor-zeroing the destination and then using sub / sbb.

That's 4 uops for the front-end on AMD, and on Intel Broadwell and later. On Intel before Broadwell, sbb reg,reg is 2 uops. The xor-zeroing is off the critical path (can happen before the data to be negated is ready), so this has a total latency of 2 or 3 cycles for the high half. The low half is of course ready with 1 cycle latency.

Clang's mov/neg for the low half is maybe better on Ryzen, which has mov-elimination for GP integer, but still needs an ALU execution unit for xor-zeroing. But for older CPUs, it puts a mov on the critical path for latency. But usually back-end ALU pressure is not as big a deal as front-end bottlenecks, for instructions that can use any ALU port.


To negate in-place, use neg to subtract from 0

neg   rdx              ; high half first
neg   rax             ; subtract RDX:RAX from 0
sbb   rdx, 0          ; with carry from low to high half

neg is exactly equivalent to sub from 0, as far as setting flags and performance.

ADC/SBB with an immediate 0 is only 1 uop on Intel SnB/IvB/Haswell, as a special case. It's still 2 uops on Nehalem and earlier, though. But without mov-elimination, mov to another register and then sbb back into RDX would be slower.

The low half (in RAX) is ready in the first cycle after it's ready as an input to neg. (So out-of-order execution of later code can get started using the low half.)

The high half neg rdx can run in parallel with the low half. Then sbb rdx,0 has to wait for rdx from neg rdx and CF from neg rax. So it's ready at the later of 1 cycle after the low half, or 2 cycles after the input high half is ready.


The above sequence is better than any in the question, being fewer uops on very common Intel CPUs. On Broadwell and later (single-uop SBB, not just for immediate 0)

;; equally good on Broadwell/Skylake, and AMD.  But worse on Intel SnB through HSW
NOT RDX
NEG RAX
SBB RDX,-1     ; can't use the imm=0 special case

Any of the 4-instruction sequences are obviously sub-optimal, being more total uops. And some of them have worse ILP / dependency chains / latency, like 2 instructions on the critical path for the low half, or a 3-cycle chain for the high half.

like image 42
Peter Cordes Avatar answered Sep 18 '22 13:09

Peter Cordes