I'm trying to understand the algorithm for floating point addition. In the past I've only had to do this on paper, and did it by converting it to decimal and back again. I am writing a floating point ALU in an HDL so that won't work in this case. I've read a lot of the questions on the topic, the most useful of which I've used for this example, and read many articles, but some concepts elude me. I've written the questions in context below, but for summary here they are at top: <ol> <li>When is the implicit bit in the mantissa 0, and when is it 1?</li> <li>After the addition, how do we algorithmically check for normalization, and then determine which way to shift?</li> <li>If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?</li> </ol> Borrowing from this example: <pre class="prettyprint"><code>00001000111100110110010010011100 (1.46487e-33) 00000000000011000111111010000100 (1.14741e-39) </code></pre> First split them into their components (sign, exp, mantissa) <pre class="prettyprint"><code>0 00010001 11100110110010010011100 0 00000000 00011000111111010000100 </code></pre> Next tack on their implicit integer value <pre class="prettyprint"><code>0 00010001 1.11100110110010010011100 0 00000000 0.00011000111111010000100 </code></pre> Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero Next subtract the greater exponent from the lesser and shift the lesser mantissa right by that amount <pre class="prettyprint"><code> 00010001 - 00000000 ___________ 00010001 = 17 0.00000000000000000000110 </code></pre> Add the mantissas <pre class="prettyprint"><code> 0.00000000000000000000110 + 1.11100110110010010011100 ______________________________ 1.11100110110010010100010 </code></pre> Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place? Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?

<blockquote> When is the implicit bit in the mantissa 0, and when is it 1? </blockquote> When the (biased) exponent is at the minimum value (e.g. 0), the implicit bit is 0. When the (biased) exponent is at the maximum value, there is no implicit bit. The value is infinity or NAN. Otherwise the implicit bit is 1. <hr> <blockquote> After the addition, how do we algorithmically check for normalization, and then determine which way to shift? </blockquote> With addition (of 2 operands with the same sign), if there is a carry out of the most-significant place of the sum, shift right, increment the exponent. Check for exponent overflow. With addition (of 2 operands with the opposite sign) - which is effectively subtraction, if all significant bits zero, return zero. Else if the most-significant place is zero, repeatedly shift left as needed, decrementing the exponent except do not decrement the exponent lower than minimum. <hr> <blockquote> If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not? </blockquote> No. Common FP encoding is sign-magnitude. <hr> <blockquote> Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero </blockquote> Yes, the biased exponent is at minimum. <hr> <blockquote> Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place? </blockquote> The MSBit might be zero when the sign differ (or both operands are 0.0). If the sum is not zero shift left as described above. <hr> <blockquote> Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are? </blockquote> 2's compliment is not used. When the signs are the same, add magnitudes. When signs differ, flip the 2nd one's sign bit and call your subtraction code. <hr> IEEE-754 does not use "mantissa" but significand per wiki. I thought it was significant per spec. I'll review later.

Answer1 see small C++/VCL example of disecting the 32 and 64 bit floats on how to deal with normalized/denormalized and zero/inf/nan states of floats... The state is defined as combination of exponent and mantissa value. Answer2 No you do not shift so 1 gets on first place before decimal point. Instead you shift by difference of exponents between operands. Also the ALU operations on mantissas are usually done on bigger mantissa bitwidth to lower rounding errors ... only the result is truncated to original mantissa bitwidth after normalization. Answer3 Yes you can also use 2'os complement so for <code>c=a+b</code> you can do it for example like this C++ (using already dissected parts of operands): <pre class="prettyprint lang-cpp prettyprint-override"><code>// disect a,b to its compounds and add implicit 1 to mantissas if needed // a = (-1)^a.sig * a.man * 2^a.exp; // b = (-1)^b.sig * b.man * 2^b.exp; // here you should handle special cases when operands are (+/-)inf or nan if (a.sig) a.man=-a.man; // convert mantisas to 2'o complement if (b.sig) b.man=-b.man; sh=a.exp-b.exp; // exponent difference if (abs(a.man)>=abs(b.man)) // shift the abs smaller operand to avoid additional rounding { b.man>>=sh; c.exp=a.exp; } else { a.man<<=sh; c.exp=b.exp; } c.man=a.man+b.man; // 2'os complement addition c.sig=0; if (c.man<0){ c.sig=1; c.man=-c.man; } // convert back to unsigned mantisa // here you should normalize the c.exp,c.man and remove implicit 1 from mantisa // and reconstruct result float // c = (-1)^c.sig * c.man * 2^c.exp; </code></pre> You can do this also on unsigned ALU however you need to sort operands by sign and abs value which is much more work...

performing floating point addition algorithmically

Tags:

algorithm

math

floating-point

ieee-754

I'm trying to understand the algorithm for floating point addition. In the past I've only had to do this on paper, and did it by converting it to decimal and back again. I am writing a floating point ALU in an HDL so that won't work in this case. I've read a lot of the questions on the topic, the most useful of which I've used for this example, and read many articles, but some concepts elude me. I've written the questions in context below, but for summary here they are at top:

When is the implicit bit in the mantissa 0, and when is it 1?
After the addition, how do we algorithmically check for normalization, and then determine which way to shift?
If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?

Borrowing from this example:

00001000111100110110010010011100 (1.46487e-33)
00000000000011000111111010000100 (1.14741e-39)

First split them into their components (sign, exp, mantissa)

0 00010001 11100110110010010011100
0 00000000 00011000111111010000100

Next tack on their implicit integer value

0 00010001 1.11100110110010010011100
0 00000000 0.00011000111111010000100

Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero

Next subtract the greater exponent from the lesser and shift the lesser mantissa right by that amount

  00010001
- 00000000
___________
00010001 = 17

0.00000000000000000000110

Add the mantissas

   0.00000000000000000000110
+  1.11100110110010010011100
______________________________
   1.11100110110010010100010

Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?

Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?

799

asked Nov 11 '21 05:11

richbai90

Video Answer

2 Answers

When is the implicit bit in the mantissa 0, and when is it 1?

When the (biased) exponent is at the minimum value (e.g. 0), the implicit bit is 0.
When the (biased) exponent is at the maximum value, there is no implicit bit. The value is infinity or NAN.
Otherwise the implicit bit is 1.

After the addition, how do we algorithmically check for normalization, and then determine which way to shift?

With addition (of 2 operands with the same sign), if there is a carry out of the most-significant place of the sum, shift right, increment the exponent. Check for exponent overflow.

With addition (of 2 operands with the opposite sign) - which is effectively subtraction, if all significant bits zero, return zero. Else if the most-significant place is zero, repeatedly shift left as needed, decrementing the exponent except do not decrement the exponent lower than minimum.

If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?

No. Common FP encoding is sign-magnitude.

Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero

Yes, the biased exponent is at minimum.

Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?

The MSBit might be zero when the sign differ (or both operands are 0.0). If the sum is not zero shift left as described above.

Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?

2's compliment is not used. When the signs are the same, add magnitudes. When signs differ, flip the 2nd one's sign bit and call your subtraction code.

IEEE-754 does not use "mantissa" but significand per wiki. I thought it was significant per spec. I'll review later.

107

answered Oct 12 '22 22:10

chux - Reinstate Monica

Answer1

see small C++/VCL example of disecting the 32 and 64 bit floats on how to deal with normalized/denormalized and zero/inf/nan states of floats... The state is defined as combination of exponent and mantissa value.

Answer2

No you do not shift so 1 gets on first place before decimal point. Instead you shift by difference of exponents between operands. Also the ALU operations on mantissas are usually done on bigger mantissa bitwidth to lower rounding errors ... only the result is truncated to original mantissa bitwidth after normalization.

Answer3

Yes you can also use 2'os complement so for c=a+b you can do it for example like this C++ (using already dissected parts of operands):

// disect a,b to its compounds and add implicit 1 to mantissas if needed
// a = (-1)^a.sig * a.man * 2^a.exp;
// b = (-1)^b.sig * b.man * 2^b.exp;

// here you should handle special cases when operands are (+/-)inf or nan

if (a.sig) a.man=-a.man; // convert mantisas to 2'o complement 
if (b.sig) b.man=-b.man;

sh=a.exp-b.exp; // exponent difference
if (abs(a.man)>=abs(b.man)) // shift the abs smaller operand to avoid additional rounding
   {
   b.man>>=sh;
   c.exp=a.exp;
   }
else
   {
   a.man<<=sh;
   c.exp=b.exp;
   }
c.man=a.man+b.man; // 2'os complement addition
c.sig=0;
if (c.man<0){ c.sig=1; c.man=-c.man; } // convert back to unsigned mantisa

// here you should normalize the c.exp,c.man and remove implicit 1 from mantisa
// and reconstruct result float
// c = (-1)^c.sig * c.man * 2^c.exp;

You can do this also on unsigned ALU however you need to sort operands by sign and abs value which is much more work...

answered Oct 12 '22 20:10

Spektre

Related questions
                            
                                Graph theory: best algorithm to find combination of edges “directions”, where each node has at most one edge directed to it
                            
                                How to make a faster algorithm
                            
                                How to show formula for variables
                            
                                How to I make my AI algorithm play 9 board tic-tac-toe?
                            
                                How to get the element which are diagonal to a certain index in an array which represents a rectangle
                            
                                Finding minimum total length of line segments to connect 2N points
                            
                                Minimum cost path from (0,0) to (N,N) on 2D grid
                            
                                Fast intersection of HashSet<int> and List<int>
                            
                                Efficient way to filter groups that do not contain all types of elements
                            
                                std::accumulate using the view std::ranges::views::values
                            
                                Need help understanding this line in an FFT algorithm
                            
                                How to build a tree array into which / out of which items can be spliced, which only allows arrays of 1, 2, 4, 8, 16, or 32 items?
                            
                                How to compare two vectors for equality?
                            
                                Constant time for multiplication in Galois Field GF(4)
                            
                                Why doesn't STL's implementation of next_permutation use the binary search?
                            
                                Minimum number of flips to get adjacent 1's in a matrix
                            
                                Efficiently get sorted sums of a sorted list
                            
                                From an interview: Removing rows and columns in an n×n matrix to maximize the sum of remaining values
                            
                                Detecting if angle is more than 180 degrees
                            
                                How to find all partitions of a set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With