Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement the totalOrder predicate for floating point values?

The IEEE 754 specification defines a total order in §5.10, which I want to implement in assembly.

From the Wikipedia description, it sounds a lot like this can be implemented branch-free, or almost branch-free, but I haven't been able to come up with a decent approach; and I couldn't find any existing spec-compliant implementation in major programming languages

When comparing two floating-point numbers, it acts as the ≤ operation, except that totalOrder(−0, +0) ∧ ¬ totalOrder(+0, −0), and different representations of the same floating-point number are ordered by their exponent multiplied by the sign bit. The ordering is then extended to the NaNs by ordering −qNaN < −sNaN < numbers < +sNaN < +qNaN, with ordering between two NaNs in the same class being based on the integer payload, multiplied by the sign bit, of those data.

Does it make sense to first check for NaNs and then either jump to a floating point comparison or handle the NaN case, or does it make more sense to move the floating point value to integer registers and do all the operations there?

(At least from reading the description, it feels like the spec authors made an effort to allow a straightforward implementation with integer instructions.)

What's the "best" way to implement a total order for floating points on x86-64 processors?

like image 250
soc Avatar asked Mar 03 '23 01:03

soc


1 Answers

This all Just Works if you compare the FP bit-patterns as sign/magnitude integers, including -0 < +0 and the NaN bit-patterns1. This is one reason why IEEE formats like binary64 (double) use a biased exponent and put the fields in that order. (Another being convenient implementation of nextafter by ++ or -- on the bit pattern.)

That can be implemented efficiently in terms of 2's complement integer compare:

  • if the signs are both cleared: non-negative numbers Just Work
  • if only one has its sign bit set: any negative is less than any non-negative, including -0.0 < +0.0 as 0x80000000 < 0x00000000 so 2's complement x <= y Just Works.
  • if both have their sign bit set ((x&y)>>63): 2's complement x<y is sign/magnitude FP x>y. In x86 asm, you may avoid the shift and just look at SF, or use the high bit of a SIMD element.

    Handling this without messing up the == case is tricky: you can't just XOR x&y sign into a < result; that would flip it when they compared equal. It would give you <= when both inputs are negative but < for other cases. I'm not sure if that's usable for sorting.


With SSE4.2 pcmpgtq you can operate on double FP values in their normal XMM registers, or SSE2 (guaranteed for x86-64) pcmpgtd for 32-bit float. (Note that pcmpgtq is relatively slow compared to pcmpgtd: fewer ports and higher latency. https://agner.org/optimize/. e.g. on Skylake, 1 p5 uop with 3c latency, vs. pcmpgtd and pcmpeqq being 1 uop for p0/p1 with 1 cycle latency.)

We can't handle the bitwise-equal case using just one pcmpgtq + sign fixups.
x1 bitwise_eq x0 gives a pcmpgtq result of 0 whether or not the inputs are positive or negative. Flipping it based on sign(x0&x1) would give inconsistent behaviour whether you want 0 or 1 to mean >, >=, < or <= according to the total order. But unfortunately the -0.0 == +0.0 behaviour of FP comparisons means we have to special-case on FP-equal, not just FP-unordered.

You don't need assembly, just type-pun to uint64_t in C for example to get the compiler to probably use movq rax, xmm0, or use intrinsics for vector regs.

But if you are using asm, you could consider doing an FP compare and branching on ZF=1 which will be set for unordered or equal, and only then doing integer. If you expect NaN and exact equality (including +-0.0 == -+0.0) to be rare, this could work well. Notice that ZF,CF,PF = 1,1,1 for unordered in the ucomisd docs. All x86 FP compares set flags the same way, either directly or via fcom/fnstsw ax/lahf.

For example a stand-alone version could look like this. (Simplify when inlining, e.g. branch directly with jb instead of setb if the caller branches):

totalOrder:   ; 0/1 integer in EAX = (xmm0 <= xmm1 totalOrder)
    xor      eax, eax
    ucomisd  xmm0, xmm1           ; ZF=0 implies PF=0 (ordered) so just check ZF
    jz    .compare_as_integer     ; unordered or FP-equal
     ; else CF accurately reflects the < or > (total) order of xmm0 vs. xmm1
    setb     al                 ; or branch with jb
    ret

;; SSE4.2, using AVX 3-operand versions.  Use movaps as needed for non-AVX
### Untested
        ; Used for unordered or FP-equal, including -0.0 == +0.0
        ; but also including -1.0 == -1.0 for example
 .compare_as_integer:          ; should work in general for any sign/magnitude integer
    vpcmpgtq xmm2, xmm1, xmm0     ; reversed order of comparison: x1>x0 == x0<x1
    vpand    xmm3, xmm1, xmm0     ; we only care about the MSB of the 64-bit integer
    vpxor    xmm2, xmm3           ; flip if x0 & x1 are negative

    vpcmpeqq xmm1, xmm0
    vpor     xmm2, xmm1
       ; top bits of XMM2 hold the boolean result for each SIMD element
       ; suitable for use with blendvpd

    vmovmskpd eax, xmm2           ; low bit of EAX = valid, high bit might be garbage
    and      eax, 1          ; optional depending on use-case
     ; EAX=1 if x0 bitwise_eq x1 or sign/magnitude x1 > x0
    ret

With AVX512VL, vpternlogq can replace all 3 of the AND/XOR/OR operations; it can implement any arbitrary boolean function of 3 inputs. (y_gt_x) ^ (x&y) | y_eq_x.

Without SSE4.2, or just as a scalar branchless strategy, I came up with this. (e.g. if values were actually in memory so you could just do mov loads instead of movq from XMM regs).

;; works on its own, or as the fallback after ucomisd/jz
compare_as_integer:
    movq     rcx, xmm0
    movq     rsi, xmm1

    xor      eax, eax
    cmp      rcx, rsi
   ; je  bitwise equal special case would simplify the rest
    setl     al                 ; 2's complement x < y
    sete     dl
    and      rcx, rsi           ; maybe something with TEST / CMOVS?
    shr      rcx, 63
    xor      al, cl           ; flip the SETL result if both inputs were negative
    or       al, dl           ; always true on bitwise equal
    ret

The xor-zeroing of EAX should make it safe to read EAX without a partial-reg stall even on P6-family, after writing AL with setl and 8-bit xor and or. (Why doesn't GCC use partial registers?). On most other CPUs, the only downside here is a false dependency on the old value of RDX which I didn't break before sete dl. If I had xor-zeroed EDX first, we could xor and or into EAX.

A branchy strategy could work like this:

;; probably slower unless data is predictable, e.g. mostly non-negative
compare_as_integer_branchy:
    movq     rcx, xmm0
    movq     rsi, xmm1

    xor      eax, eax       ; mov eax,1 with je to a ret wouldn't avoid partial-register stalls for setl al
    cmp      rcx, rsi
    je      .flip_result        ; return 1
    setl     al                 ; 2's complement x < y

    test     rcx, rsi
    js     .flip_result         ; if (x&y both negative)
    ret

.flip_result:         ; not bitwise EQ, and both inputs negative
    xor      al, 1
    ret

Mix and match parts of this if you want; the AND/SHR/XOR could be used along the non-equal path instead of test+js.


If inlining this in a case where you branch on the result, you could put the common(?)-case (finite and not equal) branch ahead of the special case handling. But then the special case includes ordered < so a hopefully-predictable branch on ZF=1 (which includes the PF=1 unordered case) might be a good idea still.

    ucomisd  xmm1, xmm0
    ja       x1_gt_x0                ; CF==0 && ZF==0
    ; maybe unordered, maybe -0 vs +0, maybe just x1 < x0

Footnote 1: NaN encodings as part of the total order

FP values (and their sign/magnitude encodings) are symmetric around zero. The sign bit is always a sign bit, even for NaNs, and can thus be handled the same way.

  • The smallest magnitude are +-0.0 of course: all exponent and mantissa bits zero.
  • The subnormals have a zero exponent field (minimum value) implying a leading zero for the mantissa. The explicit part is non-zero. Magnitude is linear with mantissa. (Zero is actually just a special case of a subnormal.)
  • The normalized numbers span exponent = 1 to exponent < max, implying a leading 1 in the mantissa. The highest value within one exponent (all mantissa bits set) is just below the ++exponent; mantissa=0 value: i.e. increment by 1 with carry from mantissa to exponent increases to next representable float/double
  • +- Infinity has exponent = all-ones, mantissa = 0
  • +- NaN has exponent = all-ones, mantissa = non-zero
    • on x86, sNaN has the highest bit of the mantissa cleared. Rest is payload with at least 1 set bit anywhere (else it's an Inf).
    • on x86, qNaN has the highest bit of the mantissa set. Rest is payload

https://cwiki.apache.org/confluence/display/stdcxx/FloatingPoint (linked from Are the bit patterns of NaNs really hardware-dependent?) shows some sNaN and qNaN encodings on a couple other ISAs. Some differ from x86, but POWER and Alpha has the MSB of the mantissa set for qNaN so they have larger integer magnitude than any sNaN.

PA-RISC chose the other way around so implementing a total order predicate on that (obsolete) ISA would need to do extra work for the FP-compare unordered case; maybe flipping that bit in both values might work if either of them are any kind of NaN, before proceeding with integer compare.

(I mention this because the same algorithm could be used in higher level languages that might not be exclusively used on x86. But you might want to just leave it and always handle the same binary bit-patterns the same way, even if that means qNaN < sNaN on some platforms. You only even get sNaN in the first place by manually writing the bit-pattern.)


PS: I know "significand" is more technically correct, but "mantissa" has fewer syllables and I like it better, and is well understood in this context.

like image 115
Peter Cordes Avatar answered Apr 28 '23 21:04

Peter Cordes