As a learning project I am implementing floating point operations (add, sub, mul, div) in software using c++. The goal is to be more comfortable with the underlying details of floating point behavior.
I am trying to match my processor operations to the exact bit, meaning IEEE 754 standard. So far it has been working great, add, sub and mult behave perfectly, I tested it on around 110 million random operations and got the same exact result to what the processor does in hardware. (Although did not take into account edge cases, overflow etc).
After that, I started moving to the last operation, division. It works fine and achieves the wanted result, but from time to time, I get the last mantissa bit wrong, not rounded up. I am having a bit of hard time understanding why. The main reference I have been using is the great talk from John Farrier (the time stamp is at the point where it shows how to round):
https://youtu.be/k12BJGSc2Nc?t=1153
That rounding has been working really well for all operation but is giving me troubles for the division. Let me give you a specific example. I am trying to divide 645.68011474609375 by 493.20962524414063
The final result I get is :
mine : 0-01111111-01001111001000111100000
c++_ : 0-01111111-01001111001000111100001
As you can see everything matches except for the last bit. The way I am computing the division is based on this video: https://www.youtube.com/watch?v=fi8A4zz1d-s
Following this, I compute 28 bits off accuracy 24 of mantissa ( hidden one + 23 mantissa) and the 3 bits for guard, round sticky plus an extra one for the possible shift. Using the algorithm of the video, I can at maximum get a normalization shift of 1, that s why I have an extra bit at the end in case gets shifted in in the normalization, so will be available in the rounding. Now here is the result I get from the division algorithm:
010100111100100011110000 0100
------------------------ ----
^ grs^
|__ to be normalized |____ extra bit
As you can see I get a 0 in the 24th position, so I will need to shift on the left by one to get the correct normalization. This mean I will get:
10100111100100011110000 100
Based on the video of John Farrier, in the case of 100 grs bits, I only normalize if the LSB of the mantissa is a 1. In my case is a zero, and that is why I do not round up my result.
The reason why I am a bit lost is that I am sure my algorithm is computing the right mantissa, I have double checked it with online calculators, the rounding strategy works for all the other operations. Also, computing in this way, triggers the normalization, which yields, in the end, the correct exponent.
Am I missing something ? a small detail somewhere?
One thing that strikes me as odd is the sticky bits, in the addition and multiplication you get a different degree of shifting, which leads to higher chances of the sticky bits to trigger, in this case here, I shift only by one maximum which puts the sticky bits as to be not really sticky.
I do hope I gave enough details to make my problem understood. Here you can find at the bottom my division implementation, is a bit filled with prints I am using for debugging but should give an idea of what I am doing, the code starts at line 374:
https://gist.github.com/giordi91/1388504fadcf94b3f6f42103dfd1f938
PS: meanwhile I am going through the "everything scientist should know about floating point numbers" in order to see if I missed something.
Because floating-point numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted - the number is rounded.
In floating point arithmetic, two extra bits are used to the far right of the significand, called the guard and round bits. At the end of the arithmetic calculation, these bits are rounded off. We always round towards the closer digit (i.e. 0.00-‐0.49 → 0 and 0.51-‐0.99 → 1).
Because JavaScript uses the IEEE 754 standard for Math, it makes use of 64-bit floating numbers. This causes precision errors when doing floating point (decimal) calculations, in short, due to computers working in Base 2 while decimal is Base 10.
The result you get from the division algorithm is inadequate. You show:
010100111100100011110000 0100
------------------------ ----
^ grs^
|__ to be normalized |____ extra bit
The mathematically exact quotient continues:
010100111100100011110000 0100 110000111100100100011110…
Thus, the residue at the point where you are rounding exceeds ½ ULP, so it should be rounded up. I did not study your code in detail, but it looks like you may have just calculated an extra bit or two of the significand1. You actually need to know that the residue is non-zero, not just whether its next bit or two is zero. The final sticky bit should be one if any of the bits at or beyond that position in the exact mathematical result would be non-zero.
1 “Significand” is the preferred term. “Mantissa” is a legacy term for the fraction portion of a logarithm. The significand of a floating-point value is linear. A mantissa is logarithmic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With