I noticed that calculating the integer part of square root of uint64_t
is much more complicated than of int64_t
. Please, does anybody have an explanation for this? Why is it seemingly much more difficult to deal with one extra bit?
The following:
int64_t sqrt_int(int64_t a) {
return sqrt(a);
}
compiles with clang 5.0 and -mfpmath=sse -msse3 -Wall -O3
to
sqrt_int(long): # @sqrt_int(long)
cvtsi2sd xmm0, rdi
sqrtsd xmm0, xmm0
cvttsd2si rax, xmm0
ret
But the following:
uint64_t sqrt_int(uint64_t a) {
return sqrt(a);
}
compiles to:
.LCPI0_0:
.long 1127219200 # 0x43300000
.long 1160773632 # 0x45300000
.long 0 # 0x0
.long 0 # 0x0
.LCPI0_1:
.quad 4841369599423283200 # double 4503599627370496
.quad 4985484787499139072 # double 1.9342813113834067E+25
.LCPI0_2:
.quad 4890909195324358656 # double 9.2233720368547758E+18
sqrt_int(unsigned long): # @sqrt_int(unsigned long)
movq xmm0, rdi
punpckldq xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
subpd xmm0, xmmword ptr [rip + .LCPI0_1]
haddpd xmm0, xmm0
sqrtsd xmm0, xmm0
movsd xmm1, qword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero
movapd xmm2, xmm0
subsd xmm2, xmm1
cvttsd2si rax, xmm2
movabs rcx, -9223372036854775808
xor rcx, rax
cvttsd2si rax, xmm0
ucomisd xmm0, xmm1
cmovae rax, rcx
ret
Int64 is used to represents 64-bit signed integers . UInt64 is used to represent 64-bit unsigned integers.
In 32-bit mode, the compiler (more precisely the <stdint. h> header) defines uint64_t as unsigned long long , because unsigned long isn't wide enough. In 64-bit mode, it defines uint64_t as unsigned long .
First off, you need to be clear that this code is converting 64-bit integers (signed or unsigned) to double precision floating point, taking the square root, and then casting the result back to a signed or unsigned integer.
The answer to your question is because Intel provided 64-bit signed integer to double precision floating-point conversion (and the opposite) in the instruction set you are compiling for, but did not do so for the unsigned case. They added the unsigned conversion instruction in AVX-512, but it does not exist prior to that. So for the signed case, the conversion to double precision and the conversion back are one instruction each. For the unsigned case, the compiler has to synthesize the conversion operation from available instructions.
You can get information on which instructions are available in which versions of SSE2/AVX/AVX-512, etc. here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
You can see discussion of the method used to synthesize the conversion here: Are there unsigned equivalents of the x87 FILD and SSE CVTSI2SD instructions?
In addition to Zalman's excellent answer: The result of the sqrt
is always less than INT64_MAX because the input of sqrt
is in the uint64_t
range. Therefore a single cvttsd2si
is sufficient to convert the double back to uint64_t
.
In other words: For all input values a
the function
uint64_t sqrt_int(uint64_t a) {
return sqrt(a);
}
has exactly the same behaviour as the modified code
uint64_t sqrt_int(uint64_t a) {
return (int64_t)sqrt(a);
}
The latter function compiles to
.LCPI0_0:
.long 1127219200 # 0x43300000
.long 1160773632 # 0x45300000
.long 0 # 0x0
.long 0 # 0x0
.LCPI0_1:
.quad 4841369599423283200 # double 4503599627370496
.quad 4985484787499139072 # double 1.9342813113834067E+25
sqrt_int(unsigned long): # @sqrt_int(unsigned long)
movq xmm0, rdi
punpckldq xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
subpd xmm0, xmmword ptr [rip + .LCPI0_1]
haddpd xmm0, xmm0
sqrtsd xmm0, xmm0
cvttsd2si rax, xmm0
ret
which is much less instruction than the original code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With