Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Denormals are known to underperform severely, 100x or so, compared to normals. This frequently causes unexpected software problems.

I'm curious, from CPU Architecture viewpoint, why denormals have to be that much slower? Is the lack of performance is intrinsic to their unfortunate representation? Or maybe CPU architects neglect them to reduce hardware cost under the (mistaken) assumption that denormals don't matter?

In the former case, if denormals are intrinsically hardware-unfriendly, are there known non-IEEE-754 floating point representations that are also gapless near zero, but more convenient for hardware implementation?

like image 872
Michael Avatar asked Apr 21 '16 22:04

Michael


People also ask

What is a denormalized float?

Conversely, a denormalized floating point value has a significand with a leading digit of zero. Of these, the subnormal numbers represent values which if normalized would have exponents below the smallest representable exponent (the exponent having a limited range).

What are Denormalized values?

Denormalized values exist in the range between the lowest representable normal value (exponent > 0) and zero itself. Significantly increased levels of precision for very small numbers. Used to also avoid gradual underflow. Difficulty in converting between floating-point precisions lies in exponent conversion.

What is flush to zero?

In flush-to-zero mode, denormalized inputs are treated as zero. Results that are too small to be represented in a normalized number are replaced with zero.


1 Answers

On most x86 systems, the cause of slowness is that denormal values trigger an FP_ASSIST which is very costly as it switches to a micro-code flow (very much like a fault).

see for example - https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/487262

The reason why this is the case, is probably that the architects decided to optimize the HW for normal values by speculating that each value is normalized (which would be more common), and did not want to risk the performance of the frequent use case for the sake of rare corner cases. This speculation is usually true, so you only pay the penalty when you're wrong. These trade-offs are very common in CPU design since any investment in one case usually adds an overhead on the entire system.

In this case, if you were to design a system that tries to optimize all type of irregular FP values, you would have to either add HW to detect and record the state of each value after each operation (which would be multiplied by the number of physical FP registers, execution units, RS entries and so on - totaling in a significant number of transistors and wires. Alternatively, you would have to add some mechanism to check the value on read, which would slow you down when reading any FP value (even on the normal ones).

Furthermore, based on the type, you would need to perform some correction or not - on x86 this is the purpose of the assist code, but if you did not make a speculation, you would have to perform this flow conditionally on each value, which would already add a large chunk of that overhead on the common path.

like image 60
Leeor Avatar answered Oct 26 '22 04:10

Leeor