Why is fp division op slower than reciprocal op plus multiply op

Question

According to Agner's instruction tables, a single fp division is slower than a single reciprocal op and a single multiply op. (This seems to be common among the x86 architectures measured)

This is an excerpt from a table for the piledriver architecture.

MULSS MULSD    x,x/m    1  5-6   0.5   P01  fma
MULPS MULPD    x,x/m    1  5-6   0.5   P01  fma
VMULPS VMULPD  y,y,y/m  2  5-6   1     P01  fma
DIVSS DIVPS    x,x/m    1  9-24  5-10  P01  fp
VDIVPS         y,y,y/m  2  9-24  9-20  P01  fp
DIVSD DIVPD    x,x/m    1  9-27  5-10  P01  fp
VDIVPD         y,y,y/m  2  9-27  9-18  P01  fp
RCPSS/PS       x,x/m    1  5     1     P01  fp

The 4th value is latency. So the multiply ops take 5-6, the division ops take 9-24, and the reciprocal op takes 5 cycles. Since 24 > 6 + 5, I'm wondering why the 2 separate ops are faster than 1 single op to get essentially the same result.

I suspect the answer to this question involves the measurement of error. Perhaps it's the case that division is much more accurate than reciprocal plus multiply. If this is the case, how does the error measurement compare? Is there a linear relationship for example, since division is nearly twice as slow as reciprocal + multiply, is it also twice as accurate?

Peter Cordes · Accepted Answer

IIRC, the fast approximate reciprocal division and sqrt instructions are basically a table lookup (from an internal table), without the iterative refinement that makes accurate division / sqrt slow and hard to pipeline. This is why / how they are implemented with one-per-clock throughput.

Notice that divss throughput isn't much better than latency until very recent microarchitectures, and even Skylake's very impressive FP divide / sqrt unit isn't fully pipelined.

As for the rest of your question, the answers are the same as for rsqrt, so see this question Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

(Thanks Ross for digging up the link)

Why is fp division op slower than reciprocal op plus multiply op

Tags:

intel

Thomas

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Why is fp division op slower than reciprocal op plus multiply op

Tags:

intel

Thomas

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us