A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.
Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?
To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.
Bit-shift is fast and can provide a nice optimization if multipliers (or especially divisors!) are powers-of-two. As John Collins mentioned, floating-point MUL can be faster than floating-point ADD/SUB -- and perhaps as fast as integer MUL, depending on architecture.
Multiplication is faster than division.
First off, you can multiply floats. The problem you have is not the multiplication itself, but the original number you've used. Multiplication can lose some precision, but here the original number you've multiplied started with lost precision. This is actually an expected behavior.
Arithmetic operations on floating point numbers consist of addition, subtraction, multiplication and division. The operations are done with algorithms similar to those used on sign magnitude integers (because of the similarity of representation) — example, only add numbers of the same sign.
It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.
For example, take this loop:
for(int j=0;j<NUMITER;j++) { for(int i=1;i<NUMEL;i++) { bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; } }
for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with
bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;
It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:
bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;
-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/
again takes 1.7 seconds.
bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/
(same as initial loop, but without expensive constant addition: 2.1 seconds)
bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/
(mostly muls, but one addition:1.9 seconds)
So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.
Some more cases:
bla *= someval; // someval very near 1.0; takes 2.1 seconds bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86 bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86 bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86
In theory the information is here:
Intel®64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT
For every processor they list, the latency on FMUL is very close to that of FADD or FDIV. On some of the older processors, FDIV is 2-3 time slower than that, while on newer processors, it's the same as FMUL.
Caveats:
The document I linked actually says you can't rely on these numbers in real life since the processor will do what it wishes to make things faster if it's correct.
There's a good chance your compiler will decide to use one of the many newer instruction sets that have a floating-point multiply / divide available.
This is a complicated document only meant to be read by compiler writers and I might have gotten it wrong. Like I'm not clear why the FDIV latency number is completely missing for some of the CPUs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With