Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are denormal floating-point values slower to handle?

It is generally the case that floating-point values that consume or produce denormals, are slower than otherwise, sometimes much slower.

Why is this the case? If it's because they trap to software instead of being handled directly in hardware, as is said to be so on some CPUs, why do they have to do that?

like image 921
rwallace Avatar asked Mar 01 '19 02:03

rwallace


People also ask

What is a denormalized floating-point number?

Conversely, a denormalized floating point value has a significand with a leading digit of zero. Of these, the subnormal numbers represent values which if normalized would have exponents below the smallest representable exponent (the exponent having a limited range).

What are denormalized values?

A number is denormalized if the exponent field contains all 0's and the fraction field does not contain all 0's. Thus denormalized single-precision numbers can be in the range (plus or minus) to inclusive. Denormalized double-precision numbers can be in the range (plus or minus) to inclusive.

What are denormal floating point numbers?

Denormal floating point numbers are essentially roundoff errors in normalized numbers near the underflow limit, realmin, which is 2 − e m a x + 1. They are equally spaced, with a spacing of eps*realmin. Zero is naturally included as the smallest denormal.

Why is the mantissa of a floating point number so small?

To increase accuracy near zero, floating point implementations let the number become “denormalized”, so instead of the smallest number being 1.0 times 2 raised to the most negative exponent, the mantissa can become as small as 0.000…1 (24 digits). The penalty is that floating point math operations become considerably slower.

What are the advantages of using denormal numbers in programming?

This allows what is known as “gradual underflow” when a result is very small, and helps avoid catastrophic division-by-zero errors. Denormal numbers can incur extra computational cost. The Wikipedia entry explains that some platforms implement denormal numbers in software, while others handle them in hardware.

What are denormals and gradual underflow?

Denormal floating point numbers and gradual underflow are an underappreciated feature of the IEEE floating point standard. Double precision denormals are so tiny that they are rarely numerically significant, but single precision denormals can be in the range where they affect some otherwise unremarkable computations.


Video Answer


1 Answers

With IEEE-754 floating-point most operands encountered are normalized floating-point numbers, and internal data paths in processors are built for normalized operands. Additional exponent bits may be used for internal representations to keep floating-point operands normalized inside the data path at all times.

Any subnormal inputs therefore require additional work to first determine the number of leading zeros to then left shift the significand for normalization while adjusting the exponent. A subnormal result requires right shifting the significand by the appropriate amount and rounding may need to be deferred until after that has happened.

If solved purely in hardware, this additional work typically requires additional hardware and additional pipeline stages: One, maybe even two, additional clock cycles each for handling subnormal inputs and subnormal outputs. But the performance of typical CPUs is sensitive to the latency of instructions, and significant effort is expended to keep latencies low. The latency of an FADD, FMUL, or FMA instruction is typically between 3 to 6 cycles depending on implementation and frequency targets.

Adding, say, 50% additional latency for the potential handling of subnormal operands is therefore unattractive, even more so because subnormal operands are rare for most use cases. Using the design philosophy of "make the common case fast, and the uncommon case functional" there is therefore a significant incentive to push the handling of subnormal operands out of the "fast path" (pure hardware) into a "slow path" (combination of existing hardware plus software).

I have participated in the design of floating-point units for x86 processors, and the common approach for handling subnormals is to invoke an internal micro-code level exception when these need to be handled. This subnormal handling may take on the order of 100 clock cycles. The most expensive part of that is typically not the execution of the fix-up code itself, but getting in and out of the microcode exception handler.

I am aware of specific use cases, for example particular filters in digital signal processing, where encountering subnormals is common. To support such applications at speed, many floating-point units support a non-standard flush-to-zero mode in which subnormal encodings are treated as zero.

Note that there are throughput-oriented processor designs with significant latency tolerance, in particular GPUs. I am familiar with NVIDIA GPUs, and best I can tell they handle subnormal operands without additional overhead and have done so for the past dozen years or so. Presumably this comes at the cost of additional pipeline stages, but the vendor does not document many of the microarchitectural details of these processors, so it is hard to know for sure. The following paper may provide some general insights how different hardware designs handle subnormal operands, some with very little overhead:

E.M. Schwarz, M. Schmookler, and S.D. Trong, "FPU implementations with denormalized numbers." IEEE Transactions on Computers, Vol. 54, No. 7, July 2005, pp. 825 - 836

like image 117
njuffa Avatar answered Nov 30 '22 11:11

njuffa