I'm trying to understand the algorithm for floating point addition. In the past I've only had to do this on paper, and did it by converting it to decimal and back again. I am writing a floating point ALU in an HDL so that won't work in this case. I've read a lot of the questions on the topic, the most useful of which I've used for this example, and read many articles, but some concepts elude me. I've written the questions in context below, but for summary here they are at top:
Borrowing from this example:
00001000111100110110010010011100 (1.46487e-33)
00000000000011000111111010000100 (1.14741e-39)
First split them into their components (sign, exp, mantissa)
0 00010001 11100110110010010011100
0 00000000 00011000111111010000100
Next tack on their implicit integer value
0 00010001 1.11100110110010010011100
0 00000000 0.00011000111111010000100
Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero
Next subtract the greater exponent from the lesser and shift the lesser mantissa right by that amount
00010001
- 00000000
___________
00010001 = 17
0.00000000000000000000110
Add the mantissas
0.00000000000000000000110
+ 1.11100110110010010011100
______________________________
1.11100110110010010100010
Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?
Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?
The addition or subtraction is done by 2's compliment method. Thus a comparator is used to detect the smaller mantissa for inversion. The leading zero counter is for normalizing the result in case of subtraction operation when the mantissa part contains the leading zeros.
Floating-point operations involve floating-point numbers and typically take longer to execute than simple binary integer operations. For this reason, most embedded applications avoid wide-spread usage of floating-point math in favor of faster, smaller integer operations.
Arithmetic operations on floating point numbers consist of addition, subtraction, multiplication and division. The operations are done with algorithms similar to those used on sign magnitude integers (because of the similarity of representation) — example, only add numbers of the same sign.
Let's look at floating-point arithmetic. When we add or subtract floating point numbers, we must first align the floating points. Let's look at an example of adding 319319.319 and 429.429429. Step 1: Represent the first number with an exponent. Step 2: Represent the second number with the same exponent as the first number.
Floating-point addition Floating-point arithmetic is usually done in hardware to make it fast. This hardware, called the floating-point unit ( FPU ), is typically distinct from the central processing unit ( CPU ).
The floating-point arithmetic unit is implemented by two loosely coupled fixed point datapath units, one for the exponent and the other for the mantissa. One such basic implementation is shown in figure 10.2.
Doing in binary is similar. The floating-point arithmetic unit is implemented by two loosely coupled fixed point datapath units, one for the exponent and the other for the mantissa. One such basic implementation is shown in figure 10.2.
When is the implicit bit in the mantissa 0, and when is it 1?
When the (biased) exponent is at the minimum value (e.g. 0), the implicit bit is 0.
When the (biased) exponent is at the maximum value, there is no implicit bit. The value is infinity or NAN.
Otherwise the implicit bit is 1.
After the addition, how do we algorithmically check for normalization, and then determine which way to shift?
With addition (of 2 operands with the same sign), if there is a carry out of the most-significant place of the sum, shift right, increment the exponent. Check for exponent overflow.
With addition (of 2 operands with the opposite sign) - which is effectively subtraction, if all significant bits zero, return zero. Else if the most-significant place is zero, repeatedly shift left as needed, decrementing the exponent except do not decrement the exponent lower than minimum.
If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?
No. Common FP encoding is sign-magnitude.
Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero
Yes, the biased exponent is at minimum.
Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?
The MSBit might be zero when the sign differ (or both operands are 0.0). If the sum is not zero shift left as described above.
Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?
2's compliment is not used. When the signs are the same, add magnitudes. When signs differ, flip the 2nd one's sign bit and call your subtraction code.
IEEE-754 does not use "mantissa" but significand per wiki. I thought it was significant per spec. I'll review later.
Answer1
see small C++/VCL example of disecting the 32 and 64 bit floats on how to deal with normalized/denormalized and zero/inf/nan states of floats... The state is defined as combination of exponent and mantissa value.
Answer2
No you do not shift so 1 gets on first place before decimal point. Instead you shift by difference of exponents between operands. Also the ALU operations on mantissas are usually done on bigger mantissa bitwidth to lower rounding errors ... only the result is truncated to original mantissa bitwidth after normalization.
Answer3
Yes you can also use 2'os complement so for c=a+b
you can do it for example like this C++ (using already dissected parts of operands):
// disect a,b to its compounds and add implicit 1 to mantissas if needed
// a = (-1)^a.sig * a.man * 2^a.exp;
// b = (-1)^b.sig * b.man * 2^b.exp;
// here you should handle special cases when operands are (+/-)inf or nan
if (a.sig) a.man=-a.man; // convert mantisas to 2'o complement
if (b.sig) b.man=-b.man;
sh=a.exp-b.exp; // exponent difference
if (abs(a.man)>=abs(b.man)) // shift the abs smaller operand to avoid additional rounding
{
b.man>>=sh;
c.exp=a.exp;
}
else
{
a.man<<=sh;
c.exp=b.exp;
}
c.man=a.man+b.man; // 2'os complement addition
c.sig=0;
if (c.man<0){ c.sig=1; c.man=-c.man; } // convert back to unsigned mantisa
// here you should normalize the c.exp,c.man and remove implicit 1 from mantisa
// and reconstruct result float
// c = (-1)^c.sig * c.man * 2^c.exp;
You can do this also on unsigned ALU however you need to sort operands by sign and abs value which is much more work...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With