I'm interested in how these are determined:
So how is it determined what number of bits is for mantissa, and what number of bits is for e. I guess this is noob question, but I would like to know the answer.
IEEE 754 numbers are divided into two based on the above three components: single precision and double precision. Special Values: IEEE has reserved some values that can ambiguity. Zero is a special value denoted with an exponent and mantissa of 0. -0 and +0 are distinct values, though they both are equal.
Single Precision: Single Precision is a format proposed by IEEE for the representation of floating-point numbers. It occupies 32 bits in computer memory. 2. Double Precision: Double Precision is also a format given by IEEE for the representation of the floating-point number. It occupies 64 bits in computer memory.
IEEE single-precision floating point computer numbering format, is a binary computing format that occupies 4 bytes (32 bits) in computer memory. In IEEE 754-2008 the 32-bit base 2 format is officially referred to as binary32. It was called single in IEEE 754-1985.
biased exponent = −3 + the "bias". In single precision, the bias is 127, so in this example the biased exponent is 124; in double precision, the bias is 1023, so the biased exponent in this example is 1020.
If you develop a format for your own then you can decide how many bits for the exponent and mantissa depending on that you need more precision or a larger range. Since IEEE-754 is designed for general use, they must choose what's better in most situations
Before IEEE-754 there were lots of floating-point formats with different pros and cons, some of those are from DEC's. Initially DEC created the 32-bit F and 64-bit D formats for their VAX system, both have 8 bits for the exponent in order to represent all important physical constants, including the Plank constant (6.626070040 × 10-34) and the Avogadro constant (6.022140857 × 1023). But they quickly realized that the number is quite limited and overflow/underflow happen every now and then so they have to add 3 more bits to the exponent to create a new 64-bit G format. When Dr. Kahan wrote the IEEE-754 draft he "suggested that DEC VAX's floating-point be copied because it was very good for its time" and that's why IEEE-754 single and double precision have 8 and 11 bits in the exponent part respectively
Another rationale for the 64-bit format is to allow repeated multiplication without overflow
For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format — a possible boon to users of optimizing compilers which reorder the sequence of arithmetic operations from that specified by the careful programmer.
"A Proposed Standard for Binary Floating-Point Arithmetic", David Stephenson, IEEE Computer, Vol. 14, No. 3, March 1981, pp. 51-62
It's the same reason that various DSPs have a wider accumulator register, usually 40-bit to allow adding 32-bit values 256 times without overflow
In fact nowadays the rule for IEEE-754 interchange format the size for the exponent is round(4 log2(k)) − 13 bits so every time we double the width of the type, the exponent will be have ~4 more bits which allows for 16 multiplications of the narrower type without overflow
In the 16-bit half-float format, as the range would be too narrow and the maximum value is even much smaller than the maximum 16-bit int value if using only 4 bits for the exponent, they use 5 bits instead. Half-floats are mainly used in computer graphics so probably the precision of 11 bits is enough, and they need bigger exponent for wider dynamic range.
For more details read Where did the free parameters of IEEE 754 come from?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With