Is there any (non-microoptimization) performance gain by coding
float f1 = 200f / 2
in comparision to
float f2 = 200f * 0.5
A professor of mine told me a few years ago that floating point divisions were slower than floating point multiplications without elaborating the why.
Does this statement hold for modern PC architecture?
Update1
In respect to a comment, please do also consider this case:
float f1; float f2 = 2 float f3 = 3; for( i =0 ; i < 1e8; i++) { f1 = (i * f2 + i / f3) * 0.5; //or divide by 2.0f, respectively }
Update 2 Quoting from the comments:
[I want] to know what are the algorithmic / architectural requirements that cause > division to be vastly more complicated in hardware than multiplication
Multiplication is faster than division.
Yes, indeed, there is a difference. A loop with a million multiplies by 0.5 took 0.11 seconds and a loop with a million divides by 2 took 1.6 seconds. So it's true for the RPG (and probably for the IBM i) that multiplying is quicker than dividing.
A floating point division where a number divides another number can be expressed as. Thus it can be said that in a floating point division, mantissas are divided and exponents are subtracted. The major steps for a floating point division are. Extract the sign of the result from the two sign bits.
And the results (see the comments) are similar to that of Intel: division is about 3-6 times slower than multiplication.
Yes, many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).
If you look at this answer you will see that division can exceed 24 cycles.
Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.
Be very careful with division, and avoid it when possible. For example, hoist float inverse = 1.0f / divisor;
out of a loop and multiply by inverse
inside the loop. (If the rounding error in inverse
is acceptable)
Usually 1.0/x
will not be exactly-representable as a float
or double
. It will be exact when x
is a power of 2. This lets compilers optimize x / 2.0f
to x * 0.5f
without any change in the result.
To let the compiler do this optimization for you even when the result won't be exact (or with a runtime-variable divisor), you need options like gcc -O3 -ffast-math
. Specifically, -freciprocal-math
(enabled by -funsafe-math-optimizations
enabled by -ffast-math
) lets the compiler replace x / y
with x * (1/y)
when that's useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think it does, but I forget).
-ffast-math
is often important to allow auto-vectorization of FP loops, especially reductions (e.g. summing an array into one scalar total), because FP math is not associative. Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?
Also note that C++ compilers can fold +
and *
into an FMA in some cases (when compiling for a target that supports it, like -march=haswell
), but they can't do that with /
.
Division has worse latency than multiplication or addition (or FMA) by a factor of 2 to 4 on modern x86 CPUs, and worse throughput by a factor of 6 to 401 (for a tight loop doing only division instead of only multiplication).
The divide / sqrt unit is not fully pipelined, for reasons explained in @NathanWhitehead's answer. The worst ratios are for 256b vectors, because (unlike other execution units) the divide unit is usually not full-width, so wide vectors have to be done in two halves. A not-fully-pipelined execution unit is so unusual that Intel CPUs have an arith.divider_active
hardware performance counter to help you find code that bottlenecks on divider throughput instead of the usual front-end or execution port bottlenecks. (Or more often, memory bottlenecks or long latency chains limiting instruction-level parallelism causing instruction throughput to be less than ~4 per clock).
However, FP division and sqrt on Intel and AMD CPUs (other than KNL) is implemented as a single uop, so it doesn't necessarily have a big throughput impact on surrounding code. The best case for division is when out-of-order execution can hide the latency, and when there are lots of multiplies and adds (or other work) that can happen in parallel with the divide.
(Integer division is microcoded as multiple uops on Intel, so it always has more impact on surrounding code that integer multiply. There's less demand for high-performance integer division, so the hardware support isn't as fancy. Related: microcoded instructions like idiv
can cause alignment-sensitive front-end bottlenecks.)
So for example, this will be really bad:
for () a[i] = b[i] / scale; // division throughput bottleneck // Instead, use this: float inv = 1.0 / scale; for () a[i] = b[i] * inv; // multiply (or store) throughput bottleneck
All you're doing in the loop is load/divide/store, and they're independent so it's throughput that matters, not latency.
A reduction like accumulator /= b[i]
would bottleneck on divide or multiply latency, rather than throughput. But with multiple accumulators that you divide or multiply at the end, you can hide the latency and still saturate the throughput. Note that sum += a[i] / b[i]
bottlenecks on add
latency or div
throughput, but not div
latency because the division isn't on the critical path (the loop-carried dependency chain).
But in something like this (approximating a function like log(x)
with a ratio of two polynomials), the divide can be pretty cheap:
for () { // (not shown: extracting the exponent / mantissa) float p = polynomial(b[i], 1.23, -4.56, ...); // FMA chain for a polynomial float q = polynomial(b[i], 3.21, -6.54, ...); a[i] = p/q; }
For log()
over the range of the mantissa, a ratio of two polynomials of order N has much less error than a single polynomial with 2N coefficients, and evaluating 2 in parallel gives you some instruction-level parallelism within a single loop body instead of one massively long dep chain, making things a LOT easier for out-of-order execution.
In this case, we don't bottleneck on divide latency because out-of-order execution can keep multiple iterations of the loop over the arrays in flight.
We don't bottleneck on divide throughput as long as our polynomials are big enough that we only have one divide for every 10 FMA instructions or so. (And in a real log()
use case, there's be a bunch of work extracting exponent / mantissa and combining things back together again, so there's even more work to do between divides.)
rcpps
x86 has an approximate-reciprocal instruction (rcpps
), which only gives you 12 bits of precision. (AVX512F has 14 bits, and AVX512ER has 28 bits.)
You can use this to do x / y = x * approx_recip(y)
without using an actual divide instruction. (rcpps
itsef is fairly fast; usually a bit slower than multiplication. It uses a table lookup from a table internal to the CPU. The divider hardware may use the same table for a starting point.)
For most purposes, x * rcpps(y)
is too inaccurate, and a Newton-Raphson iteration to double the precision is required. But that costs you 2 multiplies and 2 FMAs, and has latency about as high as an actual divide instruction. If all you're doing is division, then it can be a throughput win. (But you should avoid that kind of loop in the first place if you can, maybe by doing the division as part of another loop that does other work.)
But if you're using division as part of a more complex function, the rcpps
itself + the extra mul + FMA usually makes it faster to just divide with a divps
instruction, except on CPUs with very low divps
throughput.
(For example Knight's Landing, see below. KNL supports AVX512ER, so for float
vectors the VRCP28PS
result is already accurate enough to just multiply without a Newton-Raphson iteration. float
mantissa size is only 24 bits.)
Unlike every other ALU operation, division latency/throughput is data-dependent on some CPUs. Again, this is because it's so slow and not fully pipelined. Out-of-order scheduling is easier with fixed latencies, because it avoids write-back conflicts (when the same execution port tries to produce 2 results in the same cycle, e.g. from running a 3 cycle instruction and then two 1-cycle operations).
Generally, the fastest cases are when the divisor is a "round" number like 2.0
or 0.5
(i.e. the base2 float
representation has lots of trailing zeros in the mantissa).
float
latency (cycles) / throughput (cycles per instruction, running just that back to back with independent inputs):
scalar & 128b vector 256b AVX vector divss | mulss divps xmm | mulps vdivps ymm | vmulps ymm Nehalem 7-14 / 7-14 | 5 / 1 (No AVX) Sandybridge 10-14 / 10-14 | 5 / 1 21-29 / 20-28 (3 uops) | 5 / 1 Haswell 10-13 / 7 | 5 / 0.5 18-21 / 14 (3 uops) | 5 / 0.5 Skylake 11 / 3 | 4 / 0.5 11 / 5 (1 uop) | 4 / 0.5 Piledriver 9-24 / 5-10 | 5-6 / 0.5 9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops) Ryzen 10 / 3 | 3 / 0.5 10 / 6 (2 uops) | 3 / 1 (2 uops) Low-power CPUs: Jaguar(scalar) 14 / 14 | 2 / 1 Jaguar 19 / 19 | 2 / 1 38 / 38 (2 uops) | 2 / 2 (2 uops) Silvermont(scalar) 19 / 17 | 4 / 1 Silvermont 39 / 39 (6 uops) | 5 / 2 (No AVX) KNL(scalar) 27 / 17 (3 uops) | 6 / 0.5 KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512)
double
latency (cycles) / throughput (cycles per instruction):
scalar & 128b vector 256b AVX vector divsd | mulsd divpd xmm | mulpd vdivpd ymm | vmulpd ymm Nehalem 7-22 / 7-22 | 5 / 1 (No AVX) Sandybridge 10-22 / 10-22 | 5 / 1 21-45 / 20-44 (3 uops) | 5 / 1 Haswell 10-20 / 8-14 | 5 / 0.5 19-35 / 16-28 (3 uops) | 5 / 0.5 Skylake 13-14 / 4 | 4 / 0.5 13-14 / 8 (1 uop) | 4 / 0.5 Piledriver 9-27 / 5-10 | 5-6 / 1 9-27 / 9-18 (2 uops) | 5-6 / 1 (2 uops) Ryzen 8-13 / 4-5 | 4 / 0.5 8-13 / 8-9 (2 uops) | 4 / 1 (2 uops) Low power CPUs: Jaguar 19 / 19 | 4 / 2 38 / 38 (2 uops) | 4 / 2 (2 uops) Silvermont(scalar) 34 / 32 | 5 / 2 Silvermont 69 / 69 (6 uops) | 5 / 2 (No AVX) KNL(scalar) 42 / 42 (3 uops) | 6 / 0.5 (Yes, Agner really lists scalar as slower than packed, but fewer uops) KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512)
Ivybridge and Broadwell are different too, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its max clock speeds were lower.)
Atom, Silvermont, and even Knight's Landing (Xeon Phi based on Silvermont) have exceptionally low divide performance, and even a 128b vector is slower than scalar. AMD's low-power Jaguar CPU (used in some consoles) is similar. A high-performance divider takes a lot of die area. Xeon Phi has low power per-core, and packing lots of cores on a die gives it tighter die-area constraints that Skylake-AVX512. It seems that AVX512ER rcp28ps
/ pd
is what you're "supposed" to use on KNL.
(See this InstLatx64 result for Skylake-AVX512 aka Skylake-X. Numbers for vdivps zmm
: 18c / 10c, so half the throughput of ymm
.)
Long latency chains become a problem when they're loop-carried, or when they're so long that they stop out-of-order execution from finding parallelism with other independent work.
Footnote 1: how I made up those div vs. mul performance ratios:
FP divide vs. multiple performance ratios are even worse than that in low-power CPUs like Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use AVX512ER).
Actual divide/multiply throughput ratios for scalar (non-vectorized) double
: 8 on Ryzen and Skylake with their beefed-up dividers, but 16-28 on Haswell (data-dependent, and more likely towards the 28 cycle end unless your divisors are round numbers). These modern CPUs have very powerful dividers, but their 2-per-clock multiply throughput blows it away. (Even more so when your code can auto-vectorize with 256b AVX vectors). Also note that with the right compiler options, those multiply throughputs also apply to FMA.
Numbers from http://agner.org/optimize/ instruction tables for Intel Haswell/Skylake and AMD Ryzen, for SSE scalar (not including x87 fmul
/ fdiv
) and for 256b AVX SIMD vectors of float
or double
. See also the x86 tag wiki.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With