What's the relative speed of floating point add vs. floating point multiply

Tags:

A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.

Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?

To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.

771

asked Jul 18 '09 01:07

J. Peterson

2 Answers

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) {   for(int i=1;i<NUMEL;i++) {     bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;   } }

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(mostly muls, but one addition:1.9 seconds)

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

Some more cases:

bla *= someval; // someval very near 1.0; takes 2.1 seconds bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86 bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86 bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

174

answered Sep 23 '22 08:09

Eamon Nerbonne

In theory the information is here:

Intel®64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT

For every processor they list, the latency on FMUL is very close to that of FADD or FDIV. On some of the older processors, FDIV is 2-3 time slower than that, while on newer processors, it's the same as FMUL.

Caveats:

The document I linked actually says you can't rely on these numbers in real life since the processor will do what it wishes to make things faster if it's correct.
There's a good chance your compiler will decide to use one of the many newer instruction sets that have a floating-point multiply / divide available.
This is a complicated document only meant to be read by compiler writers and I might have gotten it wrong. Like I'm not clear why the FDIV latency number is completely missing for some of the CPUs.

answered Sep 23 '22 08:09

Scott McIntyre

Related questions
                            
                                How deterministic is floating point inaccuracy?
                            
                                Is using increment (operator++) on floats bad style?
                            
                                Why does Java BigDecimal return 1E+1?
                            
                                Convert double/float to string
                            
                                How to produce a NaN float in c?
                            
                                Java: Float Formatting depends on Locale [duplicate]
                            
                                How many unique values are there between 0 and 1 of a standard float?
                            
                                Truncate a floating point number without rounding up
                            
                                Why is FLT_MIN equal to zero?
                            
                                iOS Objective-C How to get 2 decimal rounded float value?
                            
                                PHP float calculation error when subtracting
                            
                                Ruby round float to_int if whole number
                            
                                Status of __STDC_IEC_559__ with modern C compilers
                            
                                Optimized low-accuracy approximation to `rootn(x, n)`
                            
                                Good way to hash a float vector?
                            
                                Why can't we use '==' to compare two float or double numbers [duplicate]
                            
                                Extracting mantissa and exponent from double in c#
                            
                                What's the right way to divide two Int values to obtain a Float?
                            
                                Modern x86 cost model
                            
                                How to trace a NaN in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the relative speed of floating point add vs. floating point multiply

Tags:

floating-point

x86

flops

mips

numerical-computing

J. Peterson

People also ask

2 Answers

Eamon Nerbonne

Scott McIntyre

Recent Activity

Donate For Us