Two's complement of long integer

Tags:

I want to do some long integer math (128 bit) with Intel I64 Assembler and need to create a 2's complement. Let's say that my positive value is in RDX:RAX.

2's complement is done by "flip the bits and add 1". So the most naive implementation is (4 instructions and 14 bytes of code):

  NOT RAX
  NOT RDX
  ADD RAX,1   ; Can't use INC, it doesn't set Carry
  ADC RDX,0

When I use the NEG instruction on RAX instead of NOT, it does the "+1" for me but the Carry is wrong, NEG RAX clears Carry when RAX was zero, but I need the Carry JUST IN THIS CASE. So the next best approach might be (4 instructions and 11 bytes of code):

  NOT RDX
  NEG RAX
  CMC
  ADC RDX,0                  ; fixed, thanks lurker

Still 4 instructions. But instead of adding +1, I can subtract -1 and since SBB adds the Carry-Bit to the subtrahend, I will add +1 when Carry is clear. So my next best try is this, with 3 instructions and 10 bytes of code:

   NOT RDX
   NEG RAX
   SBB RDX,-1

As you can see from my long winded text, this is not obvious to understand. Is there a better, more understandable way to do a cascaded 2's complement in assembler?

347

asked Apr 03 '15 20:04

Rolf

2 Answers

Shorter instructions or less number of instructions doesn't necessarily means faster execution because the latency and throughput for each instruction are different

For example obsolete instructions like enter, dad, loop... will be very slow and they're there only for backward compatibility. Even inc is sometimes slower than add. The same to cmc you used above on some μarchs

As a result a longer series of low latency instructions that can be executed in parallel will work much faster. Some common groups of instructions can even be fused together into a single macro-op. The compilers' optimizers always know this and will choose the most appropriate instructions to emit.

For this snippet

__int128 negate(__int128 x)
{
    return -x;
}

ICC 19.0.1 will generate the following instructions

    xor       edx, edx                                      #3.13
    xor       eax, eax                                      #3.13
    sub       rax, rdi                                      #3.13
    sbb       rdx, rsi                                      #3.13

The first two xor instructions cost zero μop, because they're handled at the register-rename stage. Now you have only 2 instructions to execute

You can switch the compiler in Godbolt link above to see various ways to negate by different compilers including MSVC (unfortunately it doesn't have a 128-bit type yet). Below are results for GCC and Clang

GCC 8.3:

    mov     rax, rdi
    neg     rax
    mov     rdx, rsi
    adc     rdx, 0
    neg     rdx

Clang:

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    sbb     rdx, rsi

As you can see, Clang also uses only 3 instructions (minus the first one to move data from the input argument to the necessary destination). But like xor reg, reg, mov can also be "free"

Things may be different if you optimize for space (like in some cases where cache misses are high), because some immediates and instructions are long

Whether it's faster or not need some micro benchmarking. But on Intel CPUs, Intel compiler (ICC) often achieves higher performance than others because it understands the architecture better.

Note that that operation is called negation, not two's complement which is a way to encode negative numbers

162

answered Sep 21 '22 13:09

phuclv

BTW, negating a 2-register number is the same in 32-bit or 16-bit mode, with EDX:EAX or DX:AX. Use the same instruction sequences.

To copy-and-negate, @phuclv's answer shows efficient compiler output. The best bet is xor-zeroing the destination and then using sub / sbb.

That's 4 uops for the front-end on AMD, and on Intel Broadwell and later. On Intel before Broadwell, sbb reg,reg is 2 uops. The xor-zeroing is off the critical path (can happen before the data to be negated is ready), so this has a total latency of 2 or 3 cycles for the high half. The low half is of course ready with 1 cycle latency.

Clang's mov/neg for the low half is maybe better on Ryzen, which has mov-elimination for GP integer, but still needs an ALU execution unit for xor-zeroing. But for older CPUs, it puts a mov on the critical path for latency. But usually back-end ALU pressure is not as big a deal as front-end bottlenecks, for instructions that can use any ALU port.

To negate in-place, use neg to subtract from 0

neg   rdx              ; high half first
neg   rax             ; subtract RDX:RAX from 0
sbb   rdx, 0          ; with carry from low to high half

neg is exactly equivalent to sub from 0, as far as setting flags and performance.

ADC/SBB with an immediate 0 is only 1 uop on Intel SnB/IvB/Haswell, as a special case. It's still 2 uops on Nehalem and earlier, though. But without mov-elimination, mov to another register and then sbb back into RDX would be slower.

The low half (in RAX) is ready in the first cycle after it's ready as an input to neg. (So out-of-order execution of later code can get started using the low half.)

The high half neg rdx can run in parallel with the low half. Then sbb rdx,0 has to wait for rdx from neg rdx and CF from neg rax. So it's ready at the later of 1 cycle after the low half, or 2 cycles after the input high half is ready.

The above sequence is better than any in the question, being fewer uops on very common Intel CPUs. On Broadwell and later (single-uop SBB, not just for immediate 0)

;; equally good on Broadwell/Skylake, and AMD.  But worse on Intel SnB through HSW
NOT RDX
NEG RAX
SBB RDX,-1     ; can't use the imm=0 special case

Any of the 4-instruction sequences are obviously sub-optimal, being more total uops. And some of them have worse ILP / dependency chains / latency, like 2 instructions on the critical path for the low half, or a 3-cycle chain for the high half.

answered Sep 18 '22 13:09

Peter Cordes

Related questions
                            
                                Assembly and multicore CPUs
                            
                                Using Multiply Accumulate Instruction Inline Assembly in C++
                            
                                How to properly create an array in ARM assembly?
                            
                                C performance and compiling options
                            
                                64-bit windows VMware detection
                            
                                Hello World in LLVM-Assembly Language on Windows
                            
                                ARM prefetch workaround
                            
                                How to set MMX registers in a Windows exception handler to emulate unsupported 3DNow! instructions
                            
                                linux kernel development
                            
                                Why does arm-gcc decrement/increment the stack pointer even when the stack is never accessed?
                            
                                Correct way to wrap CMPXCHG8B in GCC inline assembly, 32 bits
                            
                                MAC OSX Intel LLVM Assembler bug (causes Vorbis OGG loader to crash)
                            
                                Select unique/deduplication in SSE/AVX
                            
                                MOVSD performance depends on arguments
                            
                                Combining prefixes in SSE
                            
                                What is my compiler doing? (optimizing memcpy)
                            
                                Assembly Analysis Tools
                            
                                Assembly language programming hints and tips [closed]
                            
                                Code a simple thread with C and assembly
                            
                                Getting a compiler to generate adc instruction

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Two's complement of long integer

Tags:

assembly

x86-64

twos-complement

micro-optimization

Rolf

People also ask

2 Answers

phuclv

Peter Cordes

Recent Activity

Donate For Us