Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are floating point numbers stored inside the CPU?

I am a Beginner and going through Assembly basics. Now while reading the matter, I came to this paragraph. It is explaining about how floating point numbers are stored inside memory.

The exponent for a float is an 8 bit field. To allow large numbers or small numbers to be stored, the exponent is interpreted as positive or negative. The actual exponent is the value of the 8 bit field minus 127. 127 is the "exponent bias" for 32 bit floating point numbers. The fraction field of a float holds a small surprise. Since 0.0 is defined as all bits set to 0, there is no need to worry about representing 0.0 as an exponent field equal to 127 and fraction field set to all O's. All other numbers have at least one 1 bit, so the IEEE 754 format uses an implicit 1 bit to save space. So if the fraction field is 00000000000000000000000, it is interpreted as 1 . 00000000000000000000000. This allows the fraction field to be effectively 24 bits. This is a clever trick made possible by making exponent fields of OxOO and OxFF special.

I am not getting it at all.

Can you explain me how they are stored inside memory ? I don't need references, I just need a good explanation so that I can easily understand.

like image 954
Yatendra Rathore Avatar asked Jan 31 '23 00:01

Yatendra Rathore


2 Answers

Floating point numbers follow the IEEE754 standard. They have been using this set of rules mainly because floating point numbers can be (relatively) easily compared to integers and to other floating point numbers too.

There are 2 common versions of floating points: 32bit (IEEE binary32 aka single-precision float) and 64bit (binary64 aka double precision). The only difference between them is the size of their fields:

  • exponent: 8 bits for 32bit, 11 bits for 64bit
  • mantissa: 23 bits for 32bit, 52 bits for 64bit

There's an additional bit, the sign bit, that specifies if the considered number is positive or negative.

Now, take for example 12,375 base 10 (32bit):

  • First step is to convert this number in base 2: it's pretty easy, after some calculations you will have: 1100.011

  • Next you have to move the "comma" until you get 1.100011 (until the only digit before the . is a 1). How many times we move the comma? 3, that is the exponent. It means that our number can be represented as 1.100011*2^3. (It's not called a decimal point because this is binary. It's a "radix point" or "binary point".)

    Moving the . around (and counting those moves with the exponent) until the mantissa starts with a leading 1. is called "normalizing". A number that's too small to be represented that way (limited range of the exponent) is called a subnormal or denormal number.

  • After that we have to add the bias to the exponent. That's 127 for the 8-bit exponent field in 32bit floats. Why should we do this? Well the answer is: because in this way we can more easily compare floating points with integers. (Comparing FP bit-patterns as integer tells you which one has larger magnitude, if they have the same sign.) Also, incrementing the bit-pattern (including carry from the mantissa into exponent) increases the magnitude to the next representable value. (nextafter())

    If we didn't do this a negative exponent would be represented using two-complement notation, essentially putting a 1 in the most significant bit. But in this way a smaller floating point seems to be greater than a positive-exponent floating point. For this reason: we just add 127, with this little "trick" all positive exponents starts from 10000000 base 2 (which is 1 base 10) while negative exponents reach at most 01111110 base 2 (which is -1 base 10).

In our example the normalized exponent is 10000010 base 2.

  • Last thing to do is add mantix (.100011) after the exponent, the result is:

    01000001010001100000000000000000
     |  exp ||      mantix         |
    

(first bit is the sign bit)

There's a nice online converter that visualizes the bits of a 32-bit float, and shows the decimal number it represents. You can modify either and it updates the other. https://www.h-schmidt.net/FloatConverter/IEEE754.html


That was the simple version which is a good start. It simplified by leaving out:

  • Not-A-Number NaN (biased exponent = all-ones; mantissa != 0)
  • +-Infinity (biased exponent = all-ones; mantissa = 0)
  • and didn't say much about subnormal numbers (biased exponent = 0 implies a leading 0 in the mantissa instead of the normal 1).

The Wikipedia articles on single and double precision are excellent, with diagrams and lots of explanation of corner cases and details. See them for the complete details.

Also, some (mostly historical) computers use FP formats that aren't IEEE-754.

And there are other IEEE-754 formats, like 16-bit half-precision, and one notable extended-precision format is 80-bit x87 which stores the leading 1 of the significand explicitly, instead of implied by a zero or non-zero exponent.

IEEE-754 even defines some decimal floating-point formats, using 10^exp to exactly represent decimal fractions instead of binary fractions. (HW support for these is limited but does exist).

like image 129
Marco Luzzara Avatar answered Mar 16 '23 16:03

Marco Luzzara


Nothing different than grade school math. We learned in grade school to do positive whole numbers first add, subtract all that. Then we learned to make a horizontal scratch that represented a minus sign and indicated negative numbers and learned about the number line and no we could go netagive. So the presence of the negative sign or not (or negative sign vs plus sign) indicates the individual number is positive or negative. It only takes one bit in binary to represent I am negative or positive. That is the "sign" bit in the/this floating point format (or any other).

Then at some point in grade school we learned about decimal points, after doing fractions for a while. And that was just a period we put between two numbers it indicated where the last whole number was and where the fraction started. I could just stop there and say there is no reason whatsoever for base 2 to be different than base 10 from base 13 from base 27 you just put a period between two numbers to indicate where the last whole number is and the first part of the fraction. But floating point goes a little further. Now this may have been in grade school or later in middle school but they eventually taught us about scientific notation and/or other ways to represent numbers by moving that decimal point around, sort of the decimal still represented the boundary between the last whole number and the beginning of the fraction but off to the side we had a multiply against the base number to a power

12345.67 = 1.234567 * 10^4

And that is the remaining part of the puzzle. With pencil and paper so long as we have enough paper and enough pencil lead (graphite) we can write numbers with as many digits as we care to, but as you already know with whole numbers we are generally limited by the size of a register, now we can use other grade school knowledge to turn an 8 bit alu into a infinite number of bits alu (so long as we have enough bits of memory/storage) but we are still dealing with things 8 bits at a time in that case. In this case they chose to have initially a 32, 64 and 80 bit (or maybe that came later) format, so our bits are strictly limited with these numbers (we now have 16 bit and maybe smaller although that wouldnt make much sense), and they use the something times base to the power exponent. The something is the mantissa the 1.234567 above but stored without the decimal point 1234567 the location of the decimal point is assumed/agreed upon (known). It is the first non-zero digit in the number so 123456.7 we would move it to 1.234567 and adjust the exponent for 78.45 we would move it to 7.845 and adjust the exponent on the base multiplier. Since this is binary there is only one value that is not zero and that is one (a bit is either 0 or 1) so 011101.000 we move it to 1.110100 and adjust the exponent. (this is like scientific notation but base 2)

Next the number of bits in this mantissa or significant digits in scientific notation if you want to think of it that way is limited within the formst 23 or some number of bits, see the wikipedia page on single precision floating point format (the 32 bit one, double precision is 64 bits and works exactly the same way just has more mantissa and exponent bits). So we take our number however many digits it has we find the most significant one, we move the decimal point there and adjust the expnonent on the multiplier just like we did above

11101.01 = 1.110101 * 2^4

we technically dont need to store the 1 before the decimal point and we dont need to store the 2 but we need to store the 110101 and we need to store the 4 in binary form. Along with the sign in the above case indicating a positive, so the sign, exponent and mantissa and we can reconstruct this number. Or any number that conforms ones that are not really small or really big (such that the exponent wouldnt fit in the alloted number of bits).

The IEEE-754 folks then took one last step and instead of just encoding the exponent number as is they used sort of a backwards twos complement. We already know from integer math on computers about twos complement and how to understand what those numbers look like. they didnt do exactly that for some reason, it would have made so much more sense, but instead they declared that 1000...0000 one with all zeros in binary is the definition of the mid point or another way to look at it is all zeros is the smallest exponent and all ones is the largest exponent and you have to adjust it. We know from twos complement of an 8 bit number in this case the largest number is +127 and the smallest -128, what they did was change this so they could have a larger positive exponent instead of +127 to -128 as in twos complement this is backwards it is +128 to -127, to us it simply means we adjust by adding 127. In my above case with a 2 to the power 4 the binary for 4 is 100 using 8 bits twos complement that is 00000100 to "encode" it into the single precision IEEE 754 floating point format that becomes 10000011 I simply added 127 or add 128 (10000100) and then subtract one.

So I lied there are a few more things, the special cases, so far we have one bit for sign is this positive or negative, 8 bits for an encoded exponent on the power of 2 multiplier and we have the mantissa or fraction bits of the significant digits in our number. But what about zero there is no non-zero bit in zero how do we represent that number? Well that is a special case, almost hardcoded number but you can actually in the format represent a +0 and -0 with different bit patterns, but later versions of the spec I think encourage or dictate that math that results in zero is positive, but I dont know that for sure I have not seen a copy of the spec in many years as you have to pay for it to get it legally. the other special cases are called NaNs or not a number they are also special bit patterns that are known to represent NaNs...and there is more than one nan as you can put different patterns in the mantissa. These would be cases for example when you divide by zero, or when your number is so large that you cannot represent it with a number times 2 to the power N because N is too large for the number of bits encoded in the exponent (is larger than +128 before encoding for single precision) or a number too small (the exponent is smaller than -127). Although in some formats there are numbers called tiny numbers or denormals and those are ones that are not 1.xxxx they let that one slip and have 0.000...1xxxx which is an invalid format but just a wee bit smaller than the smallest number we can represent, some fpus/software dont support denormals.

now go to wikipedia and search for single precision floating point format and now that page should make a lot of sense...I hope...

like image 40
old_timer Avatar answered Mar 16 '23 16:03

old_timer