The IEEE 754 specification defines a total order in §5.10, which I want to implement in assembly.
From the Wikipedia description, it sounds a lot like this can be implemented branch-free, or almost branch-free, but I haven't been able to come up with a decent approach; and I couldn't find any existing spec-compliant implementation in major programming languages
When comparing two floating-point numbers, it acts as the ≤ operation, except that totalOrder(−0, +0) ∧ ¬ totalOrder(+0, −0), and different representations of the same floating-point number are ordered by their exponent multiplied by the sign bit. The ordering is then extended to the NaNs by ordering −qNaN < −sNaN < numbers < +sNaN < +qNaN, with ordering between two NaNs in the same class being based on the integer payload, multiplied by the sign bit, of those data.
Does it make sense to first check for NaNs and then either jump to a floating point comparison or handle the NaN case, or does it make more sense to move the floating point value to integer registers and do all the operations there?
(At least from reading the description, it feels like the spec authors made an effort to allow a straightforward implementation with integer instructions.)
What's the "best" way to implement a total order for floating points on x86-64 processors?
This all Just Works if you compare the FP bit-patterns as sign/magnitude integers, including -0 < +0
and the NaN bit-patterns1. This is one reason why IEEE formats like binary64 (double
) use a biased exponent and put the fields in that order. (Another being convenient implementation of nextafter
by ++
or --
on the bit pattern.)
That can be implemented efficiently in terms of 2's complement integer compare:
-0.0 < +0.0
as 0x80000000 < 0x00000000
so 2's complement x <= y
Just Works.if both have their sign bit set ((x&y)>>63
): 2's complement x<y
is sign/magnitude FP x>y
. In x86 asm, you may avoid the shift and just look at SF, or use the high bit of a SIMD element.
Handling this without messing up the ==
case is tricky: you can't just XOR x&y
sign into a <
result; that would flip it when they compared equal. It would give you <=
when both inputs are negative but <
for other cases. I'm not sure if that's usable for sorting.
With SSE4.2 pcmpgtq you can operate on double FP values in their normal XMM registers, or SSE2 (guaranteed for x86-64) pcmpgtd for 32-bit float. (Note that pcmpgtq
is relatively slow compared to pcmpgtd
: fewer ports and higher latency. https://agner.org/optimize/. e.g. on Skylake, 1 p5 uop with 3c latency, vs. pcmpgtd and pcmpeqq being 1 uop for p0/p1 with 1 cycle latency.)
We can't handle the bitwise-equal case using just one pcmpgtq
+ sign fixups.x1 bitwise_eq x0
gives a pcmpgtq result of 0 whether or not the inputs are positive or negative. Flipping it based on sign(x0&x1)
would give inconsistent behaviour whether you want 0 or 1 to mean >
, >=
, <
or <=
according to the total order. But unfortunately the -0.0 == +0.0
behaviour of FP comparisons means we have to special-case on FP-equal, not just FP-unordered.
You don't need assembly, just type-pun to uint64_t
in C for example to get the compiler to probably use movq rax, xmm0
, or use intrinsics for vector regs.
But if you are using asm, you could consider doing an FP compare and branching on ZF=1 which will be set for unordered or equal, and only then doing integer. If you expect NaN and exact equality (including +-0.0 == -+0.0
) to be rare, this could work well. Notice that ZF,CF,PF = 1,1,1 for unordered in the ucomisd
docs. All x86 FP compares set flags the same way, either directly or via fcom
/fnstsw ax
/lahf
.
For example a stand-alone version could look like this. (Simplify when inlining, e.g. branch directly with jb
instead of setb
if the caller branches):
totalOrder: ; 0/1 integer in EAX = (xmm0 <= xmm1 totalOrder)
xor eax, eax
ucomisd xmm0, xmm1 ; ZF=0 implies PF=0 (ordered) so just check ZF
jz .compare_as_integer ; unordered or FP-equal
; else CF accurately reflects the < or > (total) order of xmm0 vs. xmm1
setb al ; or branch with jb
ret
;; SSE4.2, using AVX 3-operand versions. Use movaps as needed for non-AVX
### Untested
; Used for unordered or FP-equal, including -0.0 == +0.0
; but also including -1.0 == -1.0 for example
.compare_as_integer: ; should work in general for any sign/magnitude integer
vpcmpgtq xmm2, xmm1, xmm0 ; reversed order of comparison: x1>x0 == x0<x1
vpand xmm3, xmm1, xmm0 ; we only care about the MSB of the 64-bit integer
vpxor xmm2, xmm3 ; flip if x0 & x1 are negative
vpcmpeqq xmm1, xmm0
vpor xmm2, xmm1
; top bits of XMM2 hold the boolean result for each SIMD element
; suitable for use with blendvpd
vmovmskpd eax, xmm2 ; low bit of EAX = valid, high bit might be garbage
and eax, 1 ; optional depending on use-case
; EAX=1 if x0 bitwise_eq x1 or sign/magnitude x1 > x0
ret
With AVX512VL, vpternlogq
can replace all 3 of the AND/XOR/OR operations; it can implement any arbitrary boolean function of 3 inputs. (y_gt_x) ^ (x&y) | y_eq_x
.
Without SSE4.2, or just as a scalar branchless strategy, I came up with this. (e.g. if values were actually in memory so you could just do mov
loads instead of movq
from XMM regs).
;; works on its own, or as the fallback after ucomisd/jz
compare_as_integer:
movq rcx, xmm0
movq rsi, xmm1
xor eax, eax
cmp rcx, rsi
; je bitwise equal special case would simplify the rest
setl al ; 2's complement x < y
sete dl
and rcx, rsi ; maybe something with TEST / CMOVS?
shr rcx, 63
xor al, cl ; flip the SETL result if both inputs were negative
or al, dl ; always true on bitwise equal
ret
The xor-zeroing of EAX should make it safe to read EAX without a partial-reg stall even on P6-family, after writing AL with setl
and 8-bit xor
and or
. (Why doesn't GCC use partial registers?). On most other CPUs, the only downside here is a false dependency on the old value of RDX which I didn't break before sete dl
. If I had xor-zeroed EDX first, we could xor
and or
into EAX.
A branchy strategy could work like this:
;; probably slower unless data is predictable, e.g. mostly non-negative
compare_as_integer_branchy:
movq rcx, xmm0
movq rsi, xmm1
xor eax, eax ; mov eax,1 with je to a ret wouldn't avoid partial-register stalls for setl al
cmp rcx, rsi
je .flip_result ; return 1
setl al ; 2's complement x < y
test rcx, rsi
js .flip_result ; if (x&y both negative)
ret
.flip_result: ; not bitwise EQ, and both inputs negative
xor al, 1
ret
Mix and match parts of this if you want; the AND/SHR/XOR could be used along the non-equal path instead of test+js
.
If inlining this in a case where you branch on the result, you could put the common(?)-case (finite and not equal) branch ahead of the special case handling. But then the special case includes ordered <
so a hopefully-predictable branch on ZF=1 (which includes the PF=1 unordered case) might be a good idea still.
ucomisd xmm1, xmm0
ja x1_gt_x0 ; CF==0 && ZF==0
; maybe unordered, maybe -0 vs +0, maybe just x1 < x0
Footnote 1: NaN encodings as part of the total order
FP values (and their sign/magnitude encodings) are symmetric around zero. The sign bit is always a sign bit, even for NaNs, and can thus be handled the same way.
https://cwiki.apache.org/confluence/display/stdcxx/FloatingPoint (linked from Are the bit patterns of NaNs really hardware-dependent?) shows some sNaN and qNaN encodings on a couple other ISAs. Some differ from x86, but POWER and Alpha has the MSB of the mantissa set for qNaN so they have larger integer magnitude than any sNaN.
PA-RISC chose the other way around so implementing a total order predicate on that (obsolete) ISA would need to do extra work for the FP-compare unordered case; maybe flipping that bit in both values might work if either of them are any kind of NaN, before proceeding with integer compare.
(I mention this because the same algorithm could be used in higher level languages that might not be exclusively used on x86. But you might want to just leave it and always handle the same binary bit-patterns the same way, even if that means qNaN < sNaN on some platforms. You only even get sNaN in the first place by manually writing the bit-pattern.)
PS: I know "significand" is more technically correct, but "mantissa" has fewer syllables and I like it better, and is well understood in this context.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With