IEEE-754 floating-point precision: How much error is allowed?

Q: How accurate is IEEE 754?

IEEE 754 specifies: Two basic floating-point formats: single and double. The IEEE single format has a significand precision of 24 bits and occupies 32 bits overall. The IEEE double format has a significand precision of 53 bits and occupies 64 bits overall.

Q: What can be the maximum value which can be represented in IEEE 754 single precision floating point representation?

A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038.

Q: What is maximum precision value of floating value?

Decimal floating-point (DECFLOAT) The position of the decimal point is stored in each decimal floating-point value. The maximum precision is 34 digits. The range of a decimal floating-point number is either 16 or 34 digits of precision, and an exponent range of 10-383 to 10+384 or 10-6143 to 10+6144, respectively.

Tags:

floating-point

floating-accuracy

ieee-754

glibc

double-precision

I'm working on porting the sqrt function (for 64-bit doubles) from fdlibm to a model-checker tool I'm using at the moment (cbmc).
As part of my doings, I read a lot about the ieee-754 standard, but I think I didn't understand the guarantees of precision for the basic operations (incl. sqrt).

Testing my port of fdlibm's sqrt, I got the following calculation with sqrt on a 64-bit double:

sqrt(1977061516825203605555216616167125005658976571589721139027150498657494589171970335387417823661417383745964289845929120708819092392090053015474001800648403714048.0) = 44464159913633855548904943164666890000299422761159637702558734139742800916250624.0

(this case broke a simple post-condition in my test regarding precision; I'm not sure anymore if this post-condition is possible with IEEE-754)

For a comparison, several multi-precision tools calculated something like:

sqrt(1977061516825203605555216616167125005658976571589721139027150498657494589171970335387417823661417383745964289845929120708819092392090053015474001800648403714048.0) =44464159913633852501611468455197640079591886932526256694498106717014555047373210.truncated

One can see, that the 17-th number from the left is different, meaning an error like:

3047293474709469249920707535828633381008060627422728245868877413.0

Question 1: Is this huge amount of error allowed?

The standard is saying that every basic operation (+,-,*,/,sqrt) should be within 0.5 ulps, meaning that it should be equal to a mathematically exact result rounded to the nearest fp-representation (wiki is saying that some libraries only guarantees 1 ulp, but that isn't that important at the moment).

Question 2: Does that mean, that every basic operation should have an error < 2.220446e-16 with 64-bit doubles (machine-epsilon)?

I did calculate the same with a x86-32 linux system (glibc / eglibc) and got the same result like that obtained with fdlibm, which let me think that:

a: I did something wrong (but how: printf would be a candidate, but I don't know if that could be the reason)
b: the error/precision is common in these libraries

276

asked Nov 30 '10 20:11

sascha

1 Answers

The IEEE-754 standard requires that so called "basic operations" (which include addition, multiplication, division and square root) are correctly rounded. This means that there is a unique allowed answer, and it is the closest representable floating-point number to the so-called "infinitely precise" result of the operation.

In double-precision, numbers have 53 binary digits of precision, so the correct answer is the exact answer rounded to 53 significant digits. As Rick Regan showed in his answer, this is exactly the result that you got.

The answers to your questions are:

Question 1: Is this huge amount of error allowed?

Yes, but it is quite misleading to call this error "huge". The fact is that there is no double-precision value that could be returned that would have a smaller error.

Question 2: Does that mean, that every basic operation should have an error < 2.220446e-16 with 64-bit doubles (machine-epsilon)?

No. It means that every basic operation should be rounded to the (unique) closest representable floating-point number according to the current rounding mode. This is not quite the same as saying that the relative error is bounded by machine epsilon.

Question 3: Which result do you obtain with your x86 hardware and gcc + libc?

The same answer you did, because sqrt is correctly rounded on any reasonable platform.

answered Sep 23 '22 01:09

Stephen Canon

Related questions
                            
                                Preventing FsCheck from generating NaN and infinities
                            
                                What's the meaning of "0" as prefix in a floating point literal?
                            
                                How do you read floating numbers (REAL) from a sqlite database on the iphone?
                            
                                Returning a 1.0f gives me 1065353216
                            
                                Is there any scenario where function fma in libc can be used?
                            
                                Why conversion from unsigned long long to double can lead to data loss?
                            
                                Float numbers in Java
                            
                                0.1 + 0.7 != 0.8 [duplicate]
                            
                                strange behaviour when comparing floating points in rspec
                            
                                C++ next float with numeric_limits / epsilon?
                            
                                Java: How to convert a String of Binary values to a Float and vice-versa?
                            
                                How to read in floats from a file?
                            
                                C - How to check if the number is integer or float? [closed]
                            
                                How to convert float to string
                            
                                Why does Perl's sprintf not round floating point numbers correctly?
                            
                                Is this C function written in poor form?
                            
                                What is the Objective-C equivalent of Double.NaN in Java?
                            
                                math.Mod in Go returns integer part instead of floating-point remainder
                            
                                c++ comparing two floating point values
                            
                                exact representation of floating points in c

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With