Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which algorithms benefit most from fused multiply add?

fma(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result.

Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding?

It's not obvious, as rounding after multiplications which we avoid tends to be less problematic than rounding after addition which we don't.

like image 352
taw Avatar asked Aug 28 '10 04:08

taw


People also ask

How does fused multiply add work?

That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.

Is FMA faster?

In spite of the punny title of this post, fma actually stands for fused multiply-add, i.e. it implements the formula fma(a,b,c)=a*b+c . It is faster than separate multiplication and addition because on most CPUs and GPUs it counts as one instruction. It also introduces less rounding error.

What is Mac multiply accumulate unit?

In recent years, Multiply-Accumulate (MAC) unit is developing for various high performance applications. MAC unit is a fundamental block in the computing devices, especially Digital Signal Processor (DSP). MAC unit performs multiplication and accumulation process.


2 Answers

taw hit on one important example; more generally, FMA allows library writers to efficiently implement many other floating-point operations with correct rounding.

For example, a platform that has an FMA can use it to implement correctly rounded divide and square root (PPC and Itanium took this approach), which lets the FPU be basically a single-purpose FMA machine. Peter Tang and John Harrison (Intel), and Peter Markstein (HP) have some papers that explain this use if you're curious.

The example taw gave is more broadly useful than just in tracking error bounds. It allows you to represent the product of two floating point numbers as a sum of two floating point numbers without any rounding error; this is quite useful in implementing correctly-rounded floating-point library functions. Jean-Michel Muller's book or the papers on crlibm would be good starting places to learn more about these uses.

FMA is also broadly useful in argument reduction in math-library style routines for certain types of arguments; when one is doing argument reduction, the goal of the computation is often a term of the form (x - a*b), where (a*b) is very nearly equal to x itself; in particular, the result is often on the order of the rounding error in the (a*b) term, if this is computed without an FMA. I believe that Muller has also written some about this in his book.

like image 200
Stephen Canon Avatar answered Sep 23 '22 17:09

Stephen Canon


The only thing I found so far are "error-free transformations". For any floating point numbers errors from a+b, a-b, and a*b are also floating point numbers (in round to nearest mode, assuming no overflow/underflow etc. etc.).

Addition (and obviously subtraction) error is easy to compute; if abs(a) >= abs(b), error is exactly b-((a+b)-a) (2 flops, or 4-5 if we don't know which is bigger). Multiplication error is trivial to compute with fma - it is simply fma(a,b,-a*b). Without fma it's 16 flops of rather nasty code. And fully generic emulation of correctly rounded fma is even slower than that.

Extra 16 flops of error tracking per flop of real computation is a huge overkill, but with just 1-5 pipeline-friendly flops it's quite reasonable, and for many algorithms based on that 50%-200% overhead of error tracking and compensation results in error as small as if all calculations were done in twice the number of bits they were, avoiding ill-conditioning in many cases.

Interestingly, fma isn't ever used in these algorithms to compute results, just to find errors, because finding error of fma is a slow as finding error of multiplication was without fma.

Relevant keywords to search would be "compensated Horner scheme" and "compensated dot product", with Horner scheme benefiting a lot more.

like image 21
taw Avatar answered Sep 21 '22 17:09

taw