What are the steps in the algorithm to do floating point division?
Why is the result slower than say, multiplication?
Is it done the same way we do division by hand? By repeatedly dividing by the divisor, subtracting the result to obtain a remainder, aligning the number again and continuing till the remainder is less than a particular value?
Also, why do we gain on performance if instead of doing
a = b / c
we do
d = 1 / c
a = b * d
?
Edit: Basically I was asking because someone asked me to distribute a value among contenders based on the assignment of weights. I did all this in integers and was later asked to convert to float, which caused a slowdown in performance. I was just interested in knowing how would C or C++ do these operations that would cause the slowness.
Often floating point multiply is faster than integer multiply (because floating point multiply is used more often so the CPU designers spend more effort optimising that path). Floating point divide may well be slow (often more than 10 cycles) , but then so is integer divide.
Multiplication is faster than division.
Division is slower than multiplication because there is no direct, parallel method for calculating it : Either there is an iteration, or hardware is copied to implement the iteration as cascaded (or pipelined) blocks.
It used to be that both multiplication and division were slow operations. Nowadays multiplication is a bit faster (but slightly slower than addition/subtraction), but division still is slower than the others.
FPU division often basically uses Newton-Raphson (or some other algorithm) to get a reciprocal then multiplies by that reciprocal. That's why the reciprocal operation is slightly faster than the general division operation.
This HP paper (which is actually more understandable than most papers I come across talking about Newton-Raphson) has this to say about floating point division:
Floating point division and square root take considerably longer to compute than addition and multiplication. The latter two are computed directly while the former are usually computed with an iterative algorithm. The most common approach is to use a division-free Newton-Raphson iteration to get an approximation to the reciprocal of the denominator (division) or the reciprocal square root, and then multiply by the numerator (division) or input argument (square root).
From a hardware point of view division is a iterative algorithm, and the time it takes is proportional to the number of bits. The fastest division that is currently around uses the radix4 algorithm which generates 4 bit of result per iteration. For a 32 bit divide you need 8 steps at least.
Multiplication can be done in parallel to a certain degree. Without going into detail you can break up a large multiplication into several smaller, independent ones. These multiplications can again be broken down until you're at a bit-level, or you stop earlier and use a small lookup-table in hardware. This makes the multiplication hardware heavy from a silicon real estate point of view but very fast as well. It's the classic size/speed tradeoff.
You need log2 steps to combine the parallel computed results, so a 32 bit multiply need 5 logical steps (if you go down to the minimum). Fortunately these 5 steps are a good deal simpler than the division steps (it's just additions). That means in practice multiplies are even faster.
As described in the Wikipedia article Division algorithm, there are two main aproaches to division in computers:
Uses the following recurrence and finds one digit per iteration:
partialRemainder[j+1] = radix * partialRemainder[j] - quotientDigit[n-(j+1)]*denominator
Starts with an estimation and converges on the quotient. How accurate you are depends on the number of iterations.
Newton-Raphson division (very briefly):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With