I want to perform division on an FPU in C (using integer values):
float foo;
uint32_t *ptr1, *ptr2;
foo = (float)*(ptr1) / (float)*(ptr2);
And in NASM (from object compiled via GCC) it has following representation:
mov rax, QWORD [ptr1]
mov eax, DWORD [rax]
mov eax, eax
test rax, rax
js ?_001
pxor xmm0, xmm0
cvtsi2ss xmm0, rax
jmp ?_002
?_001:
mov rdx, rax
shr rdx, 1
and eax, 01H
or rdx, rax
pxor xmm0, xmm0
cvtsi2ss xmm0, rdx
addss xmm0, xmm0
?_002:
mov rax, QWORD [ptr2]
; ... for ptr2 pattern repeats
What does this "black magic" under ?_001 mean? Isn't only cvtsi2ss enough for conversion from integer to float?
You must be looking at unoptimized code. That is a waste of time. When the optimizer is disabled, compilers generate a bunch of nonsense code for various reasons—to achieve a faster compilation speed, to make it easier to set breakpoints on lines of source, to make it easier to catch bugs, etc.
When you generate optimized code on a compiler targeting x86-64, all of this noise goes away, the code becomes much more efficient, and by consequence, much easier to interpret/understand.
Here's a function that performs the operation you want. I've written it as a function so that I can pass the inputs as opaque parameters and the compiler can't optimize it away.
float DivideAsFloat(uint32_t *ptr1, uint32_t *ptr2)
{
return (float)(*ptr1) / (float)(*ptr2);
}
Here's the object code that all versions of GCC (back to 4.9.0) generate for this function:
DivideAsFloat(unsigned int*, unsigned int*):
mov eax, DWORD PTR [rdi] ; retrieve value of 'ptr1' parameter
pxor xmm0, xmm0 ; zero-out xmm0 register
pxor xmm1, xmm1 ; zero-out xmm1 register
cvtsi2ssq xmm0, rax ; convert *ptr1 into a floating-point value in XMM0
mov eax, DWORD PTR [rsi] ; retrieve value of 'ptr2' parameter
cvtsi2ssq xmm1, rax ; convert *ptr2 into a floating-point value in XMM1
divss xmm0, xmm1 ; divide the two floating-point values
ret
This is almost exactly what you would expect to see. The only "black magic" here is the PXOR
instructions. Why is the compiler bothering to pre-zero the XMM registers before executing the CVTSI2SS
instruction that's just going to clobber them anyway? Well, because CVTSI2SS
only partially clobbers its destination register. Specifically, it clobbers only the lower bits, leaving the upper bits untouched. This results in a false dependency on the upper bits, which results in execution stalls. This dependency can be broken by pre-zeroing the register, thus preventing the possibility of stalls and speeding up execution. The PXOR
instruction is a fast, efficient way to clear a register. (I talked about this exact same phenomenon recently over here—see last paragraph.)
In fact, older versions of GCC (prior to 4.9.0) did not perform this optimization and thus generated code that did not include the PXOR
instructions. It looks more efficient, but it actually runs slower.
DivideAsFloat(unsigned int*, unsigned int*):
mov eax, DWORD PTR [rdi] ; retrieve value of 'ptr1' parameter
cvtsi2ssq xmm0, rax ; convert *ptr1 into a floating-point value in XMM0
mov eax, DWORD PTR [rsi] ; retrieve value of 'ptr2' parameter
cvtsi2ssq xmm1, rax ; convert *ptr2 into a floating-point value in XMM1
divss xmm0, xmm1 ; divide the two floating-point values
ret
Clang 3.9 emits the same code as these older versions of GCC. It doesn't know about the optimization, either. MSVC does know about it (since VS 2010), and so do modern versions of ICC (verified on ICC 16 and later; missing in ICC 13).
However, that's not to say that Anty's answer (and Mystical's comment) is entirely incorrect. CVTSI2SS
is indeed designed for converting a signed integer to a scalar single-precision float, not an unsigned integer like you have here. So what gives? Well, a 64-bit processor has 64-bit wide registers available, so the unsigned 32-bit input values can be stored as signed 64-bit intermediate values, which allows CVTSI2SS
to be used after all.
Compilers do that when optimization is enabled because it results in more efficient code. If, on the other hand, you were targeting 32-bit x86 and didn't have 64-bit registers available, the compiler would have to deal with the signed vs. unsigned problem. Here's how GCC 6.3 deals with it:
DivideAsFloat(unsigned int*, unsigned int*):
sub esp, 4
pxor xmm0, xmm0
mov eax, DWORD PTR [esp+8]
pxor xmm1, xmm1
movss xmm3, 1199570944
pxor xmm2, xmm2
mov eax, DWORD PTR [eax]
movzx edx, ax
shr eax, 16
cvtsi2ss xmm0, eax
mov eax, DWORD PTR [esp+12]
cvtsi2ss xmm1, edx
mov eax, DWORD PTR [eax]
movzx edx, ax
shr eax, 16
cvtsi2ss xmm2, edx
mulss xmm0, xmm3
addss xmm0, xmm1
pxor xmm1, xmm1
cvtsi2ss xmm1, eax
mulss xmm1, xmm3
addss xmm1, xmm2
divss xmm0, xmm1
movss DWORD PTR [esp], xmm0
fld DWORD PTR [esp]
add esp, 4
ret
This is a bit difficult to follow because of how the optimizer rearranged and interleaved instructions. Here, I've "unoptimized" it, reordering the instructions and breaking them up into more logical groups, in hopes of making it easier to follow the flow of execution. (The only instruction I removed was a dependency-breaking PXOR
—the rest of the code is the same, just rearranged.)
DivideAsFloat(unsigned int*, unsigned int*):
;;; Initialization ;;;
sub esp, 4 ; reserve 4 bytes on the stack
pxor xmm0, xmm0 ; zero-out XMM0
pxor xmm1, xmm1 ; zero-out XMM1
pxor xmm2, xmm2 ; zero-out XMM2
movss xmm3, 1199570944 ; load a constant into XMM3
;;; Deal with the first value ('ptr1') ;;;
mov eax, DWORD PTR [esp+8] ; get the pointer specified in 'ptr1'
mov eax, DWORD PTR [eax] ; dereference the pointer specified by 'ptr1'
movzx edx, ax ; put the lower 16 bits of *ptr1 in EDX
shr eax, 16 ; move the upper 16 bits of *ptr1 down to the lower 16 bits in EAX
cvtsi2ss xmm0, eax ; convert the upper 16 bits of *ptr1 to a float
cvtsi2ss xmm1, edx ; convert the lower 16 bits of *ptr1 (now in EDX) to a float
mulss xmm0, xmm3 ; multiply FP-representation of upper 16 bits of *ptr1 by magic number
addss xmm0, xmm1 ; add the result to the FP-representation of *ptr1's lower 16 bits
;;; Deal with the second value ('ptr2') ;;;
mov eax, DWORD PTR [esp+12] ; get the pointer specified in 'ptr2'
mov eax, DWORD PTR [eax] ; dereference the pointer specified by 'ptr2'
movzx edx, ax ; put the lower 16 bits of *ptr2 in EDX
shr eax, 16 ; move the upper 16 bits of *ptr2 down to the lower 16 bits in EAX
cvtsi2ss xmm2, edx ; convert the lower 16 bits of *ptr2 (now in EDX) to a float
cvtsi2ss xmm1, eax ; convert the upper 16 bits of *ptr2 to a float
mulss xmm1, xmm3 ; multiply FP-representation of upper 16 bits of *ptr2 by magic number
addss xmm1, xmm2 ; add the result to the FP-representation of *ptr2's lower 16 bits
;;; Do the division, and return the result ;;;
divss xmm0, xmm1 ; FINALLY, divide the FP-representation of *ptr1 by *ptr2
movss DWORD PTR [esp], xmm0 ; store this result onto the stack, in the memory we reserved
fld DWORD PTR [esp] ; load this result onto the top of the x87 FPU
; (the 32-bit calling convention requires floating-point values be returned this way)
add esp, 4 ; clean up the space we allocated on the stack
ret
Notice that the strategy here is to break each of the unsigned 32-bit integer values up into its two 16-bit halves. The upper half is converted to a floating-point representation and multiplied by a magic number (to compensate for the signed-ness). Then, the lower half is converted to a floating-point representation, and these two floating-point representations (each 16-bit half of the original 32-bit value) are added together. This is done twice—once for each 32-bit input value (see the two "groups" of instructions). Then, finally, the resulting two floating-point representations are divided, and the result is returned.
The logic is similar to what the unoptimized code does, but…well, more optimal. In particular, redundant instructions are removed and the algorithm is generalized so that branching on the signed-ness is unnecessary. This speeds things up because mispredicted branches are slow.
Note that Clang uses a slightly different strategy and is able to generate even more optimal code here than GCC does:
DivideAsFloat(unsigned int*, unsigned int*):
push eax ; reserve 4 bytes on the stack
mov eax, DWORD PTR [esp+12] ; get the pointer specified in 'ptr2'
mov ecx, DWORD PTR [esp+8] ; get the pointer specified in 'ptr1'
movsd xmm1, QWORD PTR 4841369599423283200 ; load a constant into XMM1
movd xmm0, DWORD PTR [ecx] ; dereference the pointer specified by 'ptr1',
; and load the bits directly into XMM0
movd xmm2, DWORD PTR [eax] ; dereference the pointer specified by 'ptr2'
; and load the bits directly into XMM2
orpd xmm0, xmm1 ; bitwise-OR *ptr1's raw bits with the magic number
orpd xmm2, xmm1 ; bitwise-OR *ptr2's raw bits with the magic number
subsd xmm0, xmm1 ; subtract the magic number from the result of the OR
subsd xmm2, xmm1 ; subtract the magic number from the result of the OR
cvtsd2ss xmm0, xmm0 ; convert *ptr1 from single-precision to double-precision in place
xorps xmm1, xmm1 ; zero register to break dependencies
cvtsd2ss xmm1, xmm2 ; convert *ptr2 from single-precision to double-precision, putting result in XMM1
divss xmm0, xmm1 ; FINALLY, do the division on the single-precision FP values
movss DWORD PTR [esp], xmm0 ; store this result onto the stack, in the memory we reserved
fld DWORD PTR [esp] ; load this result onto the top of the x87 FPU
; (the 32-bit calling convention requires floating-point values be returned this way)
pop eax ; clean up the space we allocated on the stack
ret
It doesn't even use the CVTSI2SS
instruction! Instead, it loads the integer bits and manipulates them with some magic bit-twiddling so it can treat it as a double-precision floating-point value. Some more bit-twiddling later, it uses CVTSD2SS
to convert each of these double-precision floating-point values into single-precision floating-point values. Finally, it divides the two single-precision floating-point values, and arranges to return the value.
So when targeting 32-bit, compilers do have to deal with the difference between signed and unsigned integers, but they do so in different ways, using different strategies—some probably more optimal than others. And this is why looking at optimized code is far more enlightening, aside from the fact that it is what will actually be executing on your client's machine.
In general cvtsi2ss does the trick - converts scalar integer (other sources name it doubleword integer to single scalar but my naming is consistent with other vector ins) to scalar single (float). But it expects signed integer.
So this code
mov rdx, rax
shr rdx, 1
and eax, 01H
or rdx, rax
pxor xmm0, xmm0
cvtsi2ss xmm0, rdx
addss xmm0, xmm0
help convert unsigned to signed (please note js jump - if sign bit is set this code is executed - otherwise it is skipped). Sign is set when value is greater then 0x7FFFFFFF for uint32_t.
So the "magic" code does:
mov rdx, rax ; move value from ptr1 to edx
shr rdx, 1 ; div by 2 - logic shift not arithmetic because ptr1 is unsigned
and eax, 01H ; save least significant bit
or rdx, rax ; move this bit to divided value to someway fix rounding errors
pxor xmm0, xmm0
cvtsi2ss xmm0, rdx
addss xmm0, xmm0 ; add to itself = multiply by 2
I'm not sure what compiler and what compile options you use - GCC does simply
cvtsi2ssq xmm0, rbx
cvtsi2ssq xmm1, rax
divss xmm0, xmm1
I hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With