Finding the maximum of a floating point counter

Question

My apologies if this has been asked before, but I cannot find it.

I was wondering if there is a way to calculate the point at which a single precision floating point number that is used as a counter will reach a 'maximum' (the point at which it is no longer able to add another value due to loss of precision).

For example, if I continuously add 0.1f to a float I will eventually reach a point where the value does not change:

const float INCREMENT = 0.1f;
float value = INCREMENT;
float prevVal = 0.0f;

do {
  prevVal = value;
  value += INCREMENT;
} while (value != prevVal);

cout << value << endl;

On GCC this outputs 2.09715e+06

Is there a way to compute this mathematically for different values of INCREMENT? I believe it should in theory be when the exponent portion of the float requires a shift beyond 23 bits, resulting in losing the mantissa and simply adding 0.

Eric Postpischil · Accepted Answer

Given some positive y used as an increment, the smallest X for which adding y does not produce a result greater than X is the least power of 2 not less than y divided by half the “epsilon” of the floating-point format. It can be calculated by:

Float Y = y*2/std::numeric_limits<Float>::epsilon();
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
if (X < Y) X *= 2;

A proof follows. I assume IEEE-754 binary floating-point arithmetic using round-to-nearest-ties-to-even.

When two numbers are added in IEEE-754 floating-point arithmetic, the result is the exact mathematical result rounded to the nearest representable value in a selected direction.

A note about notation: Text in source code format represents floating-point values and operations. Other text is mathematical. So x+y is the exact mathematical sum of x and y, x is x in floating-point format, and x+y is the result of adding x and y in a floating-point operation. Also, I will use Float for the floating-point type in C++.

Given a floating-point number x, consider adding a positive value y using floating-point arithmetic, x+y. Under what conditions will the result exceed x?

Let x₁ be the next value greater than x representable in the floating-point format, and let x_m be the midpoint between x and x₁. If the mathematical value of x+y is less than x_m, then the floating-point calculation x+y rounds down, so it produces x. If x+y is greater than x_m, either it rounds up and produces x₁, or it produces some greater number because y is large enough to move the sum beyond x₁. If x+y equals x_m, the result is whichever of x or x₁ has an even low digit. For reasons we will see, this is always x in the situations relevant to this question, so the calculation rounds down.

Therefore, x+y produces a result greater than x if and only if x+y exceeds x_m, meaning that y exceeds half the distance from x to x₁. Note that the distance from x to x₁ is the value of 1 in the low digit of the significand of x.

In a binary floating-point format with p digits in its significand, the position value of the low digit is 2^1−p times the position value of the high digit. For example, if x is 2^e, the highest bit in its significand represents 2^e, and the lowest bit represents 2^e+1−p.

The question asks, given a y, what is the least x for which x+y does not produce a result greater than x? It is the least x for which y does not exceed half the value of the low digit of the significand of x.

Let 2^e be the position value of the high bit of the significand of x. Then y ≤ ½•2^e+1−p = 2^e−p, so y•2^p ≤ 2^e.

Therefore, given some positive y, the least x for which x+y does not produce a result greater than x has its leading bit, 2^e, equal to or exceeding y•2^p. And in fact it must be exactly 2^e because all other floating-point numbers whose leading bit has position value 2^e have other bits set in their significands, so they are greater. 2^e is the least number for which the leading bit represents 2^e.

Therefore, x is the least power of two that equals or exceeds y•2^p.

In C++, std::numeric_limits<Float>::epsilon() (from the <limits> header) is the step from 1 to the next representable value, meaning it is 2^1−p. So y•2^p equals y*2/std::numeric_limits<Float>::epsilon(). (This operation is exact unless it overflows to ∞.)

Let’s assign this to a variable:

Float Y = y*2/std::numeric_limits<Float>::epsilon();

We can find the position value represented by the highest bit of Y’s significand by using frexp (from the <cmath> header) to extract the exponent from the floating-point representation of Y and ldexp (also <cmath>) to apply that exponent to a new significand (.5 because of the scale that frexp and ldexp use):

int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);

Then X is a power of two, and it is less than or equal to Y. It is in fact the greatest power of two not greater than Y, since the next greater power of 2, 2X, is greater than Y. However, we want the least power of two not less than Y. We can find this with:

if (X < Y) X *= 2;

The resulting X is the number sought by the question.

Finding the maximum of a floating point counter

Tags:

c++

floating-point

floating-accuracy

CplClegg

1 Answers

Eric Postpischil

Recent Activity

Donate For Us

Finding the maximum of a floating point counter

Tags:

c++

floating-point

floating-accuracy

CplClegg

1 Answers

Eric Postpischil

Related questions

Recent Activity

Donate For Us