Generally the formula is given as:
but while implementing it, if I do it as, just to save one floating point op,
How much does it affect precision ? or is it drastically wrong to do it this way. I know I may have been paranoid about just saving one FP op, I am ready to implement it the theoretical way, but still I would like to understand this. Whatever details, examples you can provide that would be great. Thanks.
EDIT: Of course I understand that in the second way, I will lose precision if I subtract two very close numbers in FP, but is that the only reason of implementing it the first way ?
Floating-point decimal values generally do not have an exact binary representation. This is a side effect of how the CPU represents floating point data. For this reason, you may experience some loss of precision, and some floating-point operations may produce unexpected results.
When a program attempts to do that a floating point overflow occurs. In general, a floating point overflow occurs whenever the value being assigned to a variable is larger than the maximum possible value for that variable. Floating point overflows in MODFLOW can be a symptom of a problem with the model.
What is floating point imprecision? Floating point imprecision stems from the problem of trying to store numbers like 1/10 or (. 10) in a computer with a binary number system with a finite amount of numbers. Why does the computer have trouble storing the number .
The most commonly used double precision format stores the number with 53 bits of precision. This gives approximately 16 decimal digits of precision.
It is not a problem.
First, note that 0 ≤ a < 1, so errors in the average tend to diminish, not accumulate. Incoming new data displaces old errors.
Subtracting floating-point numbers of similar magnitude (and same sign) does not lose absolute accuracy. (You wrote “precision”, but precision is the fineness with which values are represented, e.g., the width of the double
type, and that does not change with subtraction.) Subtracting numbers of similar magnitude may cause an increase of relative error: Since the result is smaller, the error is larger relative to it. However, the relative error of an intermediate value is of no concern.
In fact, subtracting two numbers, each of which equals or exceeds half the other, has no error: The correct mathematical result is exactly representable (Sterbenz’ Lemma).
So the subtraction in the latter operation sequence is likely to be exact or low-error, depending on how much the values fluctuate. Then the multiplication and the addition have the usual rounding errors, and they are not particularly worrisome unless there are both positive and negative values, which can lead to large relative errors when the average is near zero. If a fused multiply-add operation is available (See fma
in <tgmath.h>
), then you can eliminate the error from the multiplication.
In the former operation sequence, the evaluation of 1-a
will be exact if a
is at least ½. That leaves two multiplications and one addition. This will tend to have very slightly greater error than the latter sequence, but likely not enough to notice. As before, old errors will tend to diminish.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With