In my numerical simulation I have code similar to the following snippet <pre class="prettyprint"><code>double x; do { x = /* some computation */; } while (x <= 0.0); /* some algorithm that requires x to be (precisely) larger than 0 */ </code></pre> With certain compilers (e.g. gcc) on certain platforms (e.g. linux, x87 math) it is possible that <code>x</code> is computed in higher than double precision ("with excess precision"). (Update: When I talk of precision here, I mean precision /and/ range.) Under these circumstances it is conceivable that the comparison (<code>x <= 0</code>) returns false even though the next time x is rounded down to double precision it becomes 0. (And there's no guarantee that x isn't rounded down at an arbitrary point in time.) Is there any way to perform this comparison that <ul> <li>is portable,</li> <li>works in code that gets inlined,</li> <li>has no performance impact and</li> <li>doesn't exclude some arbitrary range (0, eps)?</li> </ul> I tried to use (<code>x < std::numeric_limits<double>::denorm_min()</code>) but that seemed to significantly slow down the loop when working with SSE2 math. (I know that denormals can slow down a computation, but I didn't expect them to be slower to just move around and compare.) Update: An alternative is to use <code>volatile</code> to force <code>x</code> into memory before the comparison, e.g. by writing <pre class="prettyprint"><code>} while (*((volatile double*)&x) <= 0.0); </code></pre> However, depending on the application and the optimizations applied by the compiler, this solution can introduce a noticeable overhead too. Update: The problem with any tolerance is that it's quite arbitrary, i.e. it depends on the specific application or context. I'd prefer to just do the comparison without excess precision, so that I don't have to make any additional assumptions or introduce some arbitrary epsilons into the documentation of my library functions.

In your question, you stated that using <code>volatile</code> will work but that there'll be a huge performance hit. What about using the <code>volatile</code> variable only during the comparison, allowing <code>x</code> to be held in a register? <pre class="prettyprint"><code>double x; /* might have excess precision */ volatile double x_dbl; /* guaranteed to be double precision */ do { x = /* some computation */; x_dbl = x; } while (x_dbl <= 0.0); </code></pre> <hr> You should also check if you can speed up the comparison with the smallest subnormal value by using <code>long double</code> explicitly and cache this value, ie <pre class="prettyprint"><code>const long double dbl_denorm_min = static_cast<long double>(std::numeric_limits<double>::denorm_min()); </code></pre> and then compare <pre class="prettyprint"><code>x < dbl_denorm_min </code></pre> I'd assume that a decent compiler would do this automatically, but one never knows...

How to deal with excess precision in floating-point computations?

Tags:

c++

c

floating-point

In my numerical simulation I have code similar to the following snippet

double x;
do {
  x = /* some computation */;
} while (x <= 0.0);
/* some algorithm that requires x to be (precisely) larger than 0 */

With certain compilers (e.g. gcc) on certain platforms (e.g. linux, x87 math) it is possible that x is computed in higher than double precision ("with excess precision"). (Update: When I talk of precision here, I mean precision /and/ range.) Under these circumstances it is conceivable that the comparison (x <= 0) returns false even though the next time x is rounded down to double precision it becomes 0. (And there's no guarantee that x isn't rounded down at an arbitrary point in time.)

Is there any way to perform this comparison that

is portable,
works in code that gets inlined,
has no performance impact and
doesn't exclude some arbitrary range (0, eps)?

I tried to use (x < std::numeric_limits<double>::denorm_min()) but that seemed to significantly slow down the loop when working with SSE2 math. (I know that denormals can slow down a computation, but I didn't expect them to be slower to just move around and compare.)

Update: An alternative is to use volatile to force x into memory before the comparison, e.g. by writing

} while (*((volatile double*)&x) <= 0.0);

However, depending on the application and the optimizations applied by the compiler, this solution can introduce a noticeable overhead too.

Update: The problem with any tolerance is that it's quite arbitrary, i.e. it depends on the specific application or context. I'd prefer to just do the comparison without excess precision, so that I don't have to make any additional assumptions or introduce some arbitrary epsilons into the documentation of my library functions.

241

asked Feb 02 '09 14:02

Stephan

2 Answers

In your question, you stated that using volatile will work but that there'll be a huge performance hit. What about using the volatile variable only during the comparison, allowing x to be held in a register?

double x; /* might have excess precision */
volatile double x_dbl; /* guaranteed to be double precision */
do {
  x = /* some computation */;
  x_dbl = x;
} while (x_dbl <= 0.0);

You should also check if you can speed up the comparison with the smallest subnormal value by using long double explicitly and cache this value, ie

const long double dbl_denorm_min = static_cast<long double>(std::numeric_limits<double>::denorm_min());

and then compare

x < dbl_denorm_min

I'd assume that a decent compiler would do this automatically, but one never knows...

169

answered Nov 15 '22 17:11

Christoph

I wonder whether you have the right stopping criterion. It sounds like x <= 0 is an exception condition, but not a terminating condition and that the terminating condition is easier to satisfy. Maybe there should be a break statement inside your while loop that stops the iteration when some tolerance is met. For example, a lot of algorithm terminate when two successive iterations are sufficiently close to each other.

answered Nov 15 '22 18:11

John D. Cook

Related questions
                            
                                How to use shared memory in python and C/C++
                            
                                Access to elements of array of arrays using common iterator
                            
                                How to write a c++ template that works for both a map and vector of pair>
                            
                                What are the deduction rules for automatic argument capture?
                            
                                What is the performance of std::atomic vs non-atomic variables?
                            
                                How to lazily generate a finished sequence of items and iterate over it
                            
                                Are const arguments "real" constants?
                            
                                Conditional override in derived class template
                            
                                Interoperabilty between C and C++ atomics
                            
                                Which enum values are undefined behavior in C++14, and why?
                            
                                c++ "#include" output explanation
                            
                                Template specialization and alias template deduction difference
                            
                                Will an operation done several times in sequence be simplified by compiler?
                            
                                SFINAE using VoidT with different compilers leads to different results
                            
                                Proper way to return a pointer to a `new` object from an Rcpp function
                            
                                C++20: Non-capturing lambda in non-type template parameter
                            
                                Why is there a std::move in both <algorithm> and <utility>
                            
                                GCC's implementation of angle-brackets includes. Why does it have to be as described below?
                            
                                Antivirus detecting compiled C++ files as trojans
                            
                                Why can some libraries built by older compilers link against modern code, and others cannot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With