Unsigned 64-bit to double conversion: why this algorithm from g++

Question

Using g++ 4.9.2 if I compile

bool int_dbl_com(const unsigned long long x, const double y)
{
    return x <= y;
}

then the assembler output (for the Windows x64 calling convention) is:

testq     %rcx, %rcx            # x in RCX
js        .L2
pxor      %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd   %xmm0, %xmm1          # y in XMM1
setae     %al
ret

The command cvtsi2sdq is signed conversion, and the first test and jump combination is to check if %rcx < 0. If so, we go to L2, and this I don't understand:

.L2:
movq       %rcx, %rax
andl       $1, %ecx
pxor       %xmm0, %xmm0
shrq       %rax
orq        %rcx, %rax
cvtsi2sdq  %rax, %xmm0
addsd      %xmm0, %xmm0
ucomisd    %xmm0, %xmm1
setae      %al
ret

Naively, you might halve %rcx, convert to a double in %xmm0, and then add %xmm0 to itself to get back the original value (accepting, of course, that you've lost some low order precision going from a 64-bit integer to a 64-bit float).

But that's not what the code does: it seems to save the lowest-order bit of %rcx and then ors this back to the result. Why?? And why bother when these low order bits are going to be lost anyway (or am I mistaken here)?

(The same algorithm seems to be used regardless of the optimisation; I used -O3 here to make it easier to see.)

Pascal Cuoq · Accepted Answer

.L2:
movq       %rcx, %rax
andl       $1, %ecx       ; save the least significant bit of %rax
pxor       %xmm0, %xmm0
shrq       %rax           ; make %rax represent half the original number, as a signed value
orq        %rcx, %rax     ; “round to odd”: if the division by two above was not exact, ensure the result is odd
cvtsi2sdq  %rax, %xmm0    ; convert to floating-point
addsd      %xmm0, %xmm0   ; multiply by two
ucomisd    %xmm0, %xmm1   ; compare …
setae      %al
ret

The last three instructions implement <= and return from the source code. The other ones are all part of the conversion from uint64_t to double.

The difficult-to-understand step is the one I commented as “round to odd”. “Rounding to odd” is a technique that prevents the nasty effects of “double rounding”.

In effect, the algorithm is to convert from 64-bit to 63-bit, and then from 63 bits to the 53-bit significand of IEEE 754 binary64. If implemented naively, in some cases, these two conversions can produce a result that differs from a direct single conversion from 64-bit integer to floating-point with 53-bit significand. This phenomenon is what is called “double rounding”.

Rounding to odd ensures that the result of the intermediate rounding is not to a value that would be rounded in the wrong direction in a case of double-rounding. This is enough to make the sequences below equivalent for all inputs:

64-bit ---(round to odd)---> 63-bit ---(round to nearest even)----> binary64 
64-bit -(round-to-nearest-even,the conversion the compiler wants)-> binary64

To answer other aspects of your question:

But that's not what the code does: it seems to save the lowest-order bit of %rcx and then ors this back to the result. Why?? And why bother when these low order bits are going to be lost anyway (or am I mistaken here)?

This is exactly how to implement round-to-odd in this particular instance. The least significant bit of %rcx is one iff the shift is not an exact division by two, and in this case, the result must be made odd.

The same algorithm seems to be used regardless of the optimisation; I used -O3 here to make it easier to see.

The instruction sequence is optimal (as far as I can see, for modern processors) and corresponds to the source-level conversion from uint64_t int to double. It requires no effort from the compiler to use it even at the lowest optimization level. What could happen with optimization (but doesn't happen here) is that the instructions are melded with other instructions that correspond to other source-level constructs. But there is no point in having a different instruction sequence than the optimal one to generate for conversions at -O0.

Unsigned 64-bit to double conversion: why this algorithm from g++

Tags:

floating-point

assembly

g++

x86-64

Matthew Daws

1 Answers

Pascal Cuoq

Recent Activity

Donate For Us

Unsigned 64-bit to double conversion: why this algorithm from g++

Tags:

floating-point

assembly

g++

x86-64

Matthew Daws

1 Answers

Pascal Cuoq

Related questions

Recent Activity

Donate For Us