Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.

I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?

One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.

I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.

static double sqrtsd (double x) {
    double r;
    __asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
    return r;
}
like image 562
Tommy95 Avatar asked Mar 12 '20 20:03

Tommy95


People also ask

How do you measure instruction latency?

To measure latency yourself, you make the output of each instruction an input for the next. This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles.

What is instruction throughput?

Instruction Throughput: Is usually an average of the number of instructions completed each clock cycle. IPC (Instructions Per Clock): Is how many instructions are being completing each clock cycle.

What is instruction table?

Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. The latest versions of these manuals are always available from www.agner.org/optimize. Copyright conditions are listed below.

What is reciprocal throughput?

Reciprocal throughput is simply the reciprocal of the maximum throughput of a particular instruction. Throughput is measured in instructions/cycle, so reciprocal throughput is cycles/instruction.


1 Answers

SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.

As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)

div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)


https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/

InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).

Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.

What makes some cases fast?

Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.

It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with @harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.

On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.

(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)

x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.

With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.

Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.

On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.


You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.

Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?

This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.

This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)


On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.

Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right. https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.

like image 78
Peter Cordes Avatar answered Oct 19 '22 18:10

Peter Cordes