According to Agner's instruction tables, a single fp division is slower than a single reciprocal op and a single multiply op. (This seems to be common among the x86 architectures measured)
This is an excerpt from a table for the piledriver architecture.
MULSS MULSD x,x/m 1 5-6 0.5 P01 fma
MULPS MULPD x,x/m 1 5-6 0.5 P01 fma
VMULPS VMULPD y,y,y/m 2 5-6 1 P01 fma
DIVSS DIVPS x,x/m 1 9-24 5-10 P01 fp
VDIVPS y,y,y/m 2 9-24 9-20 P01 fp
DIVSD DIVPD x,x/m 1 9-27 5-10 P01 fp
VDIVPD y,y,y/m 2 9-27 9-18 P01 fp
RCPSS/PS x,x/m 1 5 1 P01 fp
The 4th value is latency. So the multiply ops take 5-6, the division ops take 9-24, and the reciprocal op takes 5 cycles. Since 24 > 6 + 5, I'm wondering why the 2 separate ops are faster than 1 single op to get essentially the same result.
I suspect the answer to this question involves the measurement of error. Perhaps it's the case that division is much more accurate than reciprocal plus multiply. If this is the case, how does the error measurement compare? Is there a linear relationship for example, since division is nearly twice as slow as reciprocal + multiply, is it also twice as accurate?
IIRC, the fast approximate reciprocal division and sqrt instructions are basically a table lookup (from an internal table), without the iterative refinement that makes accurate division / sqrt slow and hard to pipeline. This is why / how they are implemented with one-per-clock throughput.
Notice that divss
throughput isn't much better than latency until very recent microarchitectures, and even Skylake's very impressive FP divide / sqrt unit isn't fully pipelined.
As for the rest of your question, the answers are the same as for rsqrt
, so see this question Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?
(Thanks Ross for digging up the link)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With