I have playing around with basic math function implementations in C++ for academic purposes. Today, I benchmarked the following code for Square Root:
inline float sqrt_new(float n)
{
__asm {
fld n
fsqrt
}
}
I was surprised to see that it is consistently faster than the standard sqrt
function (it takes around 85% of the execution time of the standard function).
I don't quite get why and would love to better understand it. Below I show the full code I am using to profile (in Visual Studio 2015, compiling in Release mode and with all optimizations turned on):
#include <iostream>
#include <random>
#include <chrono>
#define M 1000000
float ranfloats[M];
using namespace std;
inline float sqrt_new(float n)
{
__asm {
fld n
fsqrt
}
}
int main()
{
default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(0.0f , 1.0f);
chrono::high_resolution_clock::time_point start1, start2;
chrono::high_resolution_clock::time_point end1, end2;
float sqrt1 = 0;
float sqrt2 = 0;
for (int i = 0; i<M; i++) ranfloats[i] = diceroll(randomGenerator);
start1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);
end1 = std::chrono::high_resolution_clock::now();
start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);
end2 = std::chrono::high_resolution_clock::now();
auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
cout << "Time elapsed for SQRT1: " << time1 << " seconds" << endl;
cout << "Time elapsed for SQRT2: " << time2 << " seconds" << endl;
cout << "Average of Time for SQRT2 / Time for SQRT1: " << time2 / time1 << endl;
cout << "Equal to standard sqrt? " << (sqrt1 == sqrt2) << endl;
system("pause");
return 0;
}
EDIT: I am editing the question to include disassembly codes of both loops that calculate square roots as they came at Visual Studio 2015.
First, the disassembly for for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);
:
00091194 0F 5A C0 cvtps2pd xmm0,xmm0
00091197 E8 F2 18 00 00 call __libm_sse2_sqrt_precise (092A8Eh)
0009119C F2 0F 5A C0 cvtsd2ss xmm0,xmm0
000911A0 83 C6 04 add esi,4
000911A3 F3 0F 58 44 24 4C addss xmm0,dword ptr [esp+4Ch]
000911A9 F3 0F 11 44 24 4C movss dword ptr [esp+4Ch],xmm0
000911AF 81 FE 90 5C 46 00 cmp esi,offset __dyn_tls_dtor_callback (0465C90h)
000911B5 7C D9 jl main+190h (091190h)
Next, the disassembly for for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);
:
00091290 F3 0F 10 00 movss xmm0,dword ptr [eax]
00091294 F3 0F 11 44 24 6C movss dword ptr [esp+6Ch],xmm0
0009129A D9 44 24 6C fld dword ptr [esp+6Ch]
0009129E D9 FA fsqrt
000912A0 D9 5C 24 6C fstp dword ptr [esp+6Ch]
000912A4 F3 0F 10 44 24 6C movss xmm0,dword ptr [esp+6Ch]
000912AA 83 C0 04 add eax,4
000912AD F3 0F 58 44 24 54 addss xmm0,dword ptr [esp+54h]
000912B3 F3 0F 11 44 24 54 movss dword ptr [esp+54h],xmm0
000912B9 ?? ?? ??
000912BA ?? ?? ??
000912BB ?? ?? ??
000912BC ?? ?? ??
000912BD ?? ?? ??
000912BE ?? ?? ??
000912BF ?? ?? ??
000912C0 ?? ?? ??
000912C1 ?? ?? ??
000912C2 ?? ?? ??
000912C3 ?? ?? ??
000912C4 ?? ?? ??
000912C5 ?? ?? ??
000912C6 ?? ?? ??
000912C7 ?? ?? ??
000912C8 ?? ?? ??
000912C9 ?? ?? ??
000912CA ?? ?? ??
000912CB ?? ?? ??
000912CC ?? ?? ??
000912CD ?? ?? ??
000912CE ?? ?? ??
000912CF ?? ?? ??
000912D0 ?? ?? ??
000912D1 ?? ?? ??
000912D2 ?? ?? ??
000912D3 ?? ?? ??
000912D4 ?? ?? ??
000912D5 ?? ?? ??
000912D6 ?? ?? ??
000912D7 ?? ?? ??
000912D8 ?? ?? ??
000912D9 ?? ?? ??
000912DA ?? ?? ??
000912DB ?? ?? ??
000912DC ?? ?? ??
000912DD ?? ?? ??
000912DE ?? ?? ??
Both your loops come out pretty horrible, with many bottlenecks other than the sqrt function call or the FSQRT instruction. And at least 2x slower than optimal scalar SQRTSS (single-precision) code could do. And that's maybe 8x slower than what a decent SSE2 vectorized loop could achieve. Even without reordering any math operations, you could beat SQRTSS throughput.
Many of the reasons from https://gcc.gnu.org/wiki/DontUseInlineAsm apply to your example. The compiler won't be able to propagate constants through your function, and it won't know that the result is alway non-negative (if it isn't NaN). It also won't be able to optimize it into an fabs()
if you later square the number.
Also highly important, you defeat auto-vectorization with SSE2 SQRTPS (_mm_sqrt_ps()
). A "no-error-checking" scalar sqrt() function using intrinsics also suffers from that problem. IDK if there's any way to get optimal results without /fp:fast
, but I doubt it. (Other than writing a whole loop in assembly, or vectorizing the whole loop yourself with intrinsics).
It's pretty impressive that your Haswell CPU manages to run the function-call loop as fast as it does, although the inline-asm loop may not even be saturating FSQRT throughput either.
For some reason, your library function call is calling double sqrt(double)
, not the C++ overload float sqrt(float)
. This leads to a conversion to double and back to float. Probably you need to #include <cmath>
to get the overloads, or you could call sqrtf()
. gcc and clang on Linux call sqrtf() with your current code (without converting to double and back), but maybe their <random>
header happens to include <cmath>
, and MSVC's doesn't. Or maybe there's something else going on.
The library function-call loop keeps the sum in memory (instead of a register). Apparently the calling convention used by the 32-bit version of __libm_sse2_sqrt_precise
doesn't preserve any XMM registers. The Windows x64 ABI does preserve XMM6-XMM15, but wikipedia says this is new and the 32-bit ABI didn't do that. I assume if there were any call-preserved XMM registers, MSVC's optimizer would take advantage of them.
Anyway, besides the throughput bottleneck of calling sqrt on each independent scalar float, the loop-carried dependency on sqrt1 is a latency bottleneck that includes a store-forwarding round trip:
000911A3 F3 0F 58 44 24 4C addss xmm0,dword ptr [esp+4Ch]
000911A9 F3 0F 11 44 24 4C movss dword ptr [esp+4Ch],xmm0
Out of order execution lets rest of the code for each iteration overlap, so you just bottleneck on throughput, but no matter how efficient the library sqrt function is, this latency bottleneck limits the loop to one iteration per 6 + 3 = 9 cycles. (Haswell ADDSS latency = 3, store-forwarding latency for XMM load/store = 6 cycles. 1 cycle more than store-forwarding for integer registers. See Agner Fog's instruction tables.)
SQRTSD has a throughput of one per 8-14 cycles, so the loop-carried dependency is not the limiting bottleneck on Haswell.
The inline-asm version with has a store/reload round trip for the sqrt result, but it's not part of the loop-carried dependency chain. MSVC inline-asm syntax makes it hard to avoid store-forwarding round trips to get data into / out of inline asm. But worse, you produce the result on the x87 stack, and the compiler wants to do SSE math in XMM registers.
And then MSVC shoots itself in the foot for no reason, keeping the sum in memory instead of in an XMM register. It looks inside inline-asm statements to see which registers they affect, so IDK why it doesn't see that your inline-asm statement doesn't clobber any XMM regs.
So MSVC does a much worse job than necessary here:
00091290 movss xmm0,dword ptr [eax] # load from the array
00091294 movss dword ptr [esp+6Ch],xmm0 # store to the stack
0009129A fld dword ptr [esp+6Ch] # x87 load from stack
0009129E fsqrt
000912A0 fstp dword ptr [esp+6Ch] # x87 store to the stack
000912A4 movss xmm0,dword ptr [esp+6Ch] # SSE load from the stack (of sqrt(array[i]))
000912AA add eax,4
000912AD addss xmm0,dword ptr [esp+54h] # SSE load+add of the sum
000912B3 movss dword ptr [esp+54h],xmm0 # SSE store of the sum
So it has the same loop-carried dependency chain (ADDSS + store-forwarding) as the function-call loop. Haswell FSQRT has one per 8-17 cycle throughput, so probably it's still the bottleneck. (All the stores/reloads involving the array value are independent for each iteration, and out-of-order execution can overlap many iterations to hide that latency chain. However, they will clog up the load/store execution units and sometimes delay the critical-path loads/stores by an extra cycle. This is called a resource conflict.)
Without /fp:fast
, the sqrtf()
library function has to set errno
if the result is NaN. This is why it can't inline to just a SQRTSS.
If you did want to implement a no-checks scalar sqrt function yourself, you'd do it with Intel intrinsics syntax:
// DON'T USE THIS, it defeats auto-vectorization
static inline
float sqrt_scalar(float x) {
__m128 xvec = _mm_set_ss(x);
xvec = _mm_cvtss_f32(_mm_sqrt_ss(xvec));
}
This compiles to a near-optimal scalar loop with gcc and clang (without -ffast-math
). See it on the Godbolt compiler explorer:
# gcc6.2 -O3 for the sqrt_new loop using _mm_sqrt_ss. good scalar code, but don't optimize further.
.L10:
movss xmm0, DWORD PTR [r12]
add r12, 4
sqrtss xmm0, xmm0
addss xmm1, xmm0
cmp r12, rbx
jne .L10
This loop should bottleneck only on SQRTSS throughput (one per 7 clocks on Haswell, notably faster than SQRTSD or FSQRT), and with no resource conflicts. However, it's still garbage compared to what you could do even without re-ordering the FP adds (since FP add/mul aren't truly associative): a smart compiler (or programmer using intrinsics) would use SQRTPS to get 4 results with the same throughput as 1 result from SQRTSS. Unpack the vector of SQRT results to 4 scalars, and then you can keep exactly the same order of operations with identical rounding of intermediate results. I'm disappointed that clang and gcc didn't do this.
However, gcc and clang do manage to actually avoid calling a library function. clang3.9 (with just -O3
) uses SQRTSS without even checking for NaN. I assume that's legal, and not a compiler bug. Maybe it sees that the code doesn't use errno?
gcc6.2 on the other hand speculatively inlines sqrtf(), with a SQRTSS and a check on the input to see if it needs to call the library function.
# gcc6.2's sqrt() loop, without -ffast-math.
# speculative inlining of SQRTSS with a check + fallback
# spills/reloads a lot of stuff to memory even when it skips the call :(
# xmm1 = 0.0 (gcc -fverbose-asm says it's holding sqrt2, which is zero-initialized, so I guess gcc decides to reuse that zero)
.L9:
movss xmm0, DWORD PTR [rbx]
sqrtss xmm5, xmm0
ucomiss xmm1, xmm0 # compare input against 0.0
movss DWORD PTR [rsp+8], xmm5
jbe .L8 # if(0.0 <= SQRTSS input || unordered(0.0, input)) { skip the function call; }
movss DWORD PTR [rsp+12], xmm1 # silly gcc, this store isn't needed. ucomiss doesn't modify xmm1
call sqrtf # called for negative inputs, but not for NaN.
movss xmm1, DWORD PTR [rsp+12]
.L8:
movss xmm4, DWORD PTR [rsp+4] # silly gcc always stores/reloads both, instead of putting the stores/reloads inside the block that the jbe skips
addss xmm4, DWORD PTR [rsp+8]
add rbx, 4
movss DWORD PTR [rsp+4], xmm4
cmp rbp, rbx
jne .L9
gcc unfortunately shoots itself in the foot here, the same way MSVC does with inline-asm: There's a store-forwarding round trip as a loop-carried dependency. All the spill/reloads could be inside the block skipped by the JBE. Maybe gcc things negative inputs will be common.
Even worse, if you do use /fp:fast
or -ffast-math
, even a clever compiler like clang doesn't manage to rewrite your _mm_sqrt_ss
into a SQRTPS. Clang is normally pretty good at not just mapping intrinsics to instructions 1:1, and will come up with more optimal shuffles and blends if you miss an opportunity to combine things.
So with fast FP math enabled, using _mm_sqrt_ss
is a big loss. clang compiles the sqrt()
library function call version into RSQRTPS + a newton-raphson iteration.
Also note that your microbenchmark code isn't sensitive to the latency of your sqrt_new()
implementation, only the throughput. Latency often matters in real FP code, not just throughput. But in other cases, like doing the same thing independently to many array elements, latency doesn't matter, because out-of-order execution can hide it well enough by having instructions in flight from many loop iterations.
As I mentioned earlier, latency from theextra store/reload round trip your data takes on its way in/out of MSVC-style inline-asm is a serious problem here. When MSVC inlines the function, the fld n
doesn't come directly from the array.
BTW, Skylake has SQRTPS/SS throughput of one per 3 cycles, but still 12 cycle latency. SQRTPD/SD throughput = one per 4-6 cycles, latency = 15-16 cycles. So FP square root is more pipelined on Skylake than on Haswell. This magnifies the difference between benchmarking FP sqrt latency vs. throughput.
compiling in Release mode and with all optimizations turned on
They are not all turned on, you missed one. In the IDE it is Project > Properties > C/C++ > Code Generation > Floating Point Model. You left it at its default setting, /fp:precise. That has a very visible side-effect on the generated machine code:
00091197 E8 F2 18 00 00 call __libm_sse2_sqrt_precise (092A8Eh)
Perhaps it is intuitive enough that calling a helper function in the CRT is always slower than a inline instruction like FSQRT.
There is a lot to say about the exact semantics of /fp, the MSDN article about it is not very good. It is also hard to reverse-engineer, Microsoft purchased the code from Intel and could not obtain a source license that allowed them to re-publish the assembly code. Its original goal was certainly to deal with the horrid floating point consistency problems caused by Intel's 8087 FPU design. That is not so relevant today anymore, all mainstream C and C++ compilers now emit SSE2 code. MSVC++ does so since VS2012. These Intel library functions now mainly ensure that floating point operations still produce results that are consistent with older versions of the compiler.
__libm_sse2_sqrt_precise()
does rather a lot. At the considerable risk of trying to document an undocumented function, I think I see it:
_matherr()
function.None of this actually having anything to do with precision :) Seeing this execute at 85% perf is rather a good result aided however by FSQRT being substantially slower than SQRTSD. The latter got a lot more silicon love in modern processors.
If you care about fast floating point operations then change the setting to /fp:fast. Which produces:
00D91310 sqrtsd xmm0,xmm0
An inline instruction instead of a library call. In other words, skips the first 3 bullets in the previous list. Also beats FSQRT handily.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With