How are IEEE-754 single and double precision formats determined?

1 Answers

If you develop a format for your own then you can decide how many bits for the exponent and mantissa depending on that you need more precision or a larger range. Since IEEE-754 is designed for general use, they must choose what's better in most situations

Before IEEE-754 there were lots of floating-point formats with different pros and cons, some of those are from DEC's. Initially DEC created the 32-bit F and 64-bit D formats for their VAX system, both have 8 bits for the exponent in order to represent all important physical constants, including the Plank constant (6.626070040 × 10^-34) and the Avogadro constant (6.022140857 × 10²³). But they quickly realized that the number is quite limited and overflow/underflow happen every now and then so they have to add 3 more bits to the exponent to create a new 64-bit G format. When Dr. Kahan wrote the IEEE-754 draft he "suggested that DEC VAX's floating-point be copied because it was very good for its time" and that's why IEEE-754 single and double precision have 8 and 11 bits in the exponent part respectively

Another rationale for the 64-bit format is to allow repeated multiplication without overflow

For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format — a possible boon to users of optimizing compilers which reorder the sequence of arithmetic operations from that specified by the careful programmer.

"A Proposed Standard for Binary Floating-Point Arithmetic", David Stephenson, IEEE Computer, Vol. 14, No. 3, March 1981, pp. 51-62

It's the same reason that various DSPs have a wider accumulator register, usually 40-bit to allow adding 32-bit values 256 times without overflow

In fact nowadays the rule for IEEE-754 interchange format the size for the exponent is round(4 log₂(k)) − 13 bits so every time we double the width of the type, the exponent will be have ~4 more bits which allows for 16 multiplications of the narrower type without overflow

In the 16-bit half-float format, as the range would be too narrow and the maximum value is even much smaller than the maximum 16-bit int value if using only 4 bits for the exponent, they use 5 bits instead. Half-floats are mainly used in computer graphics so probably the precision of 11 bits is enough, and they need bigger exponent for wider dynamic range.

For more details read Where did the free parameters of IEEE 754 come from?

answered Oct 28 '22 05:10

phuclv

Related questions
                            
                                Get file modification time to nanosecond precision
                            
                                printf a float value with precision (number of decimal digits) passed in a variable [duplicate]
                            
                                Precise nth root
                            
                                high precision math on GPU
                            
                                Why is std::abs(9484282305798401ull) = 9484282305798400?
                            
                                Summing a finite prefix of an infinite series
                            
                                Arbitrary-Precision Decimals in C# [duplicate]
                            
                                Is it safe to test a float for 0.0 equality?
                            
                                What are the rules governing C++ single and double precision mixed calculations?
                            
                                Understanding casts from integer to float
                            
                                C# double precision problem
                            
                                PostgreSQL: what is the difference between float(1) and float(24)?
                            
                                Why does the order affect the rounding when adding multiple doubles in C#
                            
                                Float Precision Display (Android)
                            
                                C - Printing out float values
                            
                                gnuplot alternative with higher time precision
                            
                                How should the MySQL Decimal datatype be used in php? [duplicate]
                            
                                AVMetadataFaceObject Precision
                            
                                Casting Results of Float Multiplication Produces Differing Results if the Float is First Saved to a Variable? [duplicate]
                            
                                Swiss and Argentinian currency fourth decimal digit rounding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How are IEEE-754 single and double precision formats determined?

Tags:

precision

ieee-754

guber90

People also ask

1 Answers

phuclv

Recent Activity

Donate For Us