Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where's the 24th fraction bit on a single precision float? IEEE 754

Tags:

c++

c

ieee-754

I found myself today doing some bit manipulation and I decided to refresh my floating-point knowledge a little!

Things were going great until I saw this:

... 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits

I read it again and again but I still can't figure out where the 24th bit is, I noticed something about a binary point so I assumed that it's a point in the middle between the mantissa and the exponent.

I'm not really sure but I believe he author was talking about this bit:

         Binary point?
             |
s------e-----|-------------m----------
0 - 01111100 - 01000000000000000000000
           ^ this
like image 595
gifnoc-gkp Avatar asked Aug 14 '13 16:08

gifnoc-gkp


People also ask

What is the fraction in IEEE 754 format?

So 0.85 in IEEE 754 format is: First, we divide the bits into three groups: 1 10000001 10110011001100110011010 The first bit shows us the sign of the the number. The next 8 bits give us the exponent . The last 23 bits give us the fraction.

What is IEEE Standard 754 floating point?

Steve Hollasch / Last update 2005-Feb-24 IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms. This article gives a brief overview of IEEE floating point and its representation.

Which integers can be converted to single precision floating point?

All integers with 7 or fewer decimal digits, and any 2 n for a whole number −149 ≤ n ≤ 127, can be converted exactly into an IEEE 754 single-precision floating-point value. In the IEEE 754-2008 standard, the 32-bit base-2 format is officially referred to as binary32; it was called single in IEEE 754-1985.

What is the difference between IEEE 754 and 32-bit integer?

A signed 32-bit integer variable has a maximum value of 2 31 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2 −23) × 2 127 ≈ 3.4028235 × 10 38. All integers with 7 or fewer decimal digits, and any 2 n for a whole number −149 ≤ n ≤ 127, can be converted exactly into an IEEE 754 ...


2 Answers

The 24th bit is implicit due to normalization.

The significand is shifted left (and one subtracted from the exponent for each bit shift) until the leading bit of the significand is a 1.

Then, since the leading bit is a 1, only the other 23 bits are actually stored.

There is also the possibility of a denormal number. The exponent is stored as a "bias" format signed number, meaning that it's an unsigned number where the middle of the range is defined to mean 01. So, with 8 bits, it's stored as a number from 0..255, but 0 is interpreted to mean -128, 128 is interpreted to mean 0, and 255 is interpreted as 127 (I may have a fencepost error there, but you get the idea).

If, in the process of normalization, this is decremented to 0 (meaning an actual exponent value of -128), then normalization stops, and the significand is stored as-is. In this case, the implicit bit from normalization it taken to be a 0 instead of a 1.

Most floating point hardware is designed to basically assume numbers will be normalized, so they assume that implicit bit is a 1. During the computation, they check for the possibility of a denormal number, and in that case they do roughly the equivalent of throwing an exception, and re-start the calculation with that taken into account. This is why computation with denormals often gets drastically slower than otherwise.


  1. In case you wonder why it uses this strange format: IEEE floating point (like many others) is designed to ensure that if you treat its bit pattern as an integer of the same size, you can compare them as signed, 2's complement integers and they'll still sort into the correct order as floating point numbers. Since the sign of the number is in the most significant bit (where it is for a 2's complement integer) that's treated as the sign bit. The bits of the exponent are stored as the next most significant bits -- but if we used 2's complement for them, an exponent less than 0 would set the second most significant bit of the number, which would result in what looked like a big number as an integer. By using bias format, a smaller exponent leaves that bit clear, and a larger exponent sets it, so the order as an integer reflects the order as a floating point.
like image 96
Jerry Coffin Avatar answered Nov 05 '22 06:11

Jerry Coffin


Normally (pardon the pun), the leading bit of a floating point number is always 1; thus, it doesn't need to be stored anywhere. The reason is that, if it weren't 1, that would mean you had chosen the wrong exponent to represent it; you could get more precision by shifting the mantissa bits left and using a smaller exponent.

The one exception is denormal/subnormal numbers, which are represented by all zero bits in the exponent field (the lowest possible exponent). In this case, there is no implicit leading 1 in the mantissa, and you have diminishing precision as the value approaches zero.

like image 28
R.. GitHub STOP HELPING ICE Avatar answered Nov 05 '22 08:11

R.. GitHub STOP HELPING ICE