How to normalize a mantissa

Tags:

I'm trying to convert an int into a custom float, in which the user specifies the amount of bits reserved for the exp and mantissa, but I don't understand how the conversion works. My function takes in an int value and and int exp to represent the number (value * 2^exp) i.e value = 12, exp = 4, returns 192. but I don't understand the process I need to do to change these. I've been looking at this for days and playing with IEEE converter web apps but I just don't understand what the normalization process is. Like I see that its "move the binary point and adjust the exponent" but I have no idea what this means, can anyone give me an example to go off of? Also I don't understand what the exponent bias is. The only info I have is that you just add a number to your exponent but I don't understand why. I've been searching Google for an example I can understand but this just isn't making any sense to me

748

asked Mar 01 '15 23:03

Tommy K

Video Answer

2 Answers

A floating point number is normalized when we force the integer part of its mantissa to be exactly 1 and allow its fraction part to be whatever we like.

For example, if we were to take the number 13.25, which is 1101.01 in binary, 1101 would be the integer part and 01 would be the fraction part.

I could represent 13.25 as 1101.01*(2^0), but this isn't normalized because the integer part is not 1. However, we are allowed to shift the mantissa to the right one digit if we increase the exponent by 1:

  1101.01*(2^0)
= 110.101*(2^1)
= 11.0101*(2^2)
= 1.10101*(2^3)

This representation 1.10101*(2^3) is the normalized form of 13.25.

That said, we know that normalized floating point numbers will always come in the form 1.fffffff * (2^exp)

For efficiency's sake, we don't bother storing the 1 integer part in the binary representation itself, we just pretend it's there. So if we were to give your custom-made float type 5 bits for the mantissa, we would know the bits 10100 would actually stand for 1.10100.

Here is an example with the standard 23-bit mantissa:

enter image description here

As for the exponent bias, let's take a look at the standard 32-bit float format, which is broken into 3 parts: 1 sign bit, 8 exponent bits, and 23 mantissa bits:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm

The exponents 00000000 and 11111111 have special purposes (like representing Inf and NaN), so with 8 exponent bits, we could represent 254 different exponents, say 2^1 to 2^254, for example. But what if we want to represent 2^-3? How do we get negative exponents?

The format fixes this problem by automatically subtracting 127 from the exponent. Therefore:

0000 0001 would be 1 -127 = -126
0010 1101 would be 45 -127 = -82
0111 1111 would be 127-127 = 0
1001 0010 would be 136-127 = 9

This changes the exponent range from 2^1 ... 2^254 to 2^-126 ... 2^+127 so we can represent negative exponents.

163

answered Oct 02 '22 05:10

eigenchris

Tommy -- chux and eigenchris, along with the others have provided excellent answers, but if I am looking at your comments correctly, you still seem to be struggling with the nuts-and-bolts of "how would I take this info and then use this in creating a custom float representation where the user specifies the amount of bits for the exponent?" Don't feel bad, it is a clear as mud the first dozen times you go through it. I think I can take a stab at clearing it up.

You are familiar with the IEEE754-Single-Precision-Floating-Point representation of:

IEEE-754 Single Precision Floating Point Representation of (13.25)

  0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
 |s|      exp      |                  mantissa                   |

That the 1-bit sign-bit, 8-bit biased exponent (in 8-bit excess-127 notation), and the remaining 23-bit mantissa.

When you allow the user to choose the number of bits in the exponent, you are going to have to rework the exponent notation to work with the new user-chosen limit.

What will that change?

Will it change the sign-bit handling -- No.
Will it change the mantissa handling -- No (you will still convert the mantissa/significand to "hidden bit" format).

So the only thing you need to focus on is exponent handling.

How would you approach this? Recall, the current 8-bit exponent is in what is called excess-127 notation (where 127 represents the largest value for 7 bits allowing any bias to be contained and expressed within the current 8-bit limit. If your user chooses 6 bits as the exponent size, then what? You will have to provide a similar method to insure you have a fixed number to represent your new excess-## notation that will work within the user limit.

Take a 6-bit user limit, then a choice for the unbiased exponent value could be tried as 31 (the largest values that can be represented in 5-bits). To that you could apply the same logic (taking the 13.25 example above). Your binary representation for the number is 1101.01 to which you move the decimal 3 positions to the left to get 1.10101 which gives you an exponent bias of 3.

In your 6-bit exponent case you would add 3 + 31 to obtain your excess-31 notation for the exponent: 100010, then put the mantissa in "hidden bit" format (i.e. drop the leading 1 from 1.10101 resulting in your new custom Tommy Precision Representation:

IEEE-754 Tommy Precision Floating Point Representation of (13.25)

  0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
 |s|    exp    |                    mantissa                     |

With 1-bit sign-bit, 6-bit biased exponent (in 6-bit excess-31 notation), and the remaining 25-bit mantissa.

The same rules would apply to reversing the process to get your floating point number back from the above notation. (just using 31 instead of 127 to back the bias out of the exponent)

Hopefully this helps in some way. I don't see much else you can do if you are truly going to allow for a user-selected exponent size. Remember, the IEEE-754 standard wasn't something that was guessed at and a lot of good reasoning and trade-offs went into arriving at the 1-8-23 sign-exponent-mantissa layout. However, I think your exercise does a great job at requiring you to firmly understand the standard.

Now totally lost and not addressed in this discussion is what effects this would have on the range of numbers that could be represented in this Custom Precision Floating Point Representation. I haven't looked at it, but the primary limitation would seem to be a reduction in the MAX/MIN that could be represented.

answered Oct 02 '22 07:10

David C. Rankin

Related questions
                            
                                Please help me understand SQL vs C like programming?
                            
                                Current Standard C Compiler?
                            
                                Is it good practice to use the comma operator?
                            
                                Why does C style cast work but reinterpret_cast doesn't?
                            
                                When do I need dynamic memory? [duplicate]
                            
                                Signalled and non-signalled state of event
                            
                                Function declaration vs. definition C
                            
                                Can I allocate a specific number of bits in C?
                            
                                Could this code damage my processor?
                            
                                Without access to argv[0], how do I get the program name?
                            
                                Returning Arrays/Pointers from a function
                            
                                How to get the relative address of a field in a structure dump. [C]
                            
                                How do I write a C header file that can be used in C++ programs? [duplicate]
                            
                                How to call the static function from another c file?
                            
                                C for loop implemented differently than other languages?
                            
                                C/C++ URL decode library
                            
                                Array increment types in C - array[i]++ vs array[i++]
                            
                                In C, tan(30) gives me a negative value! Why?
                            
                                Read/write from file descriptor at offset
                            
                                C grammar in GCC source code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to normalize a mantissa

Tags:

c

floating-point

double

normalization