Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly compare an integer and a floating-point value?

How do I compare an integer and a floating-point value the right way™?

The builtin comparsion operators give incorrect results in some edge cases, for example:

#include <iomanip>
#include <iostream>

int main()
{
    long long a = 999999984306749439;
    float     b = 999999984306749440.f; // This number can be represented exactly by a `float`.

    std::cout << std::setprecision(1000);
    std::cout << a << " < " << b << " = " << (a < b) << '\n';
    // Prints `999999984306749439 < 999999984306749440 = 0`, but it should be `1`.
}

Apparently, the comparsion operators convert both operands to a same type before actually comparing them. Here lhs gets converted to float, which causes a loss of precision, and leads to an incorrect result.

Even though I understand what's going on, I'm not sure how to work around this issue.


Disclaimer: The example uses a float and a long long, but I'm looking for a generic solution that works for every combination of an integral type and a floating-point type.

like image 481
HolyBlackCat Avatar asked Nov 06 '19 15:11

HolyBlackCat


People also ask

Can you compare floats and integers?

The integer is a data type used to define a number that contains all positive, negative or zero non-fractional values. These cannot have decimals. Float is a data type used to define a number that has a fractional value. These can have decimals also.

What is a better way to compare floating point values?

To compare two floating point or double values, we have to consider the precision in to the comparison. For example, if two numbers are 3.1428 and 3.1415, then they are same up to the precision 0.01, but after that, like 0.001 they are not same.

How is a floating point value different from an integer value?

Floating point numbers are different from integer numbers in that they contain fractional parts. Even if the number to the right of the decimal point is 0 (or decimal comma, if your locale uses commas instead of periods), it's still a fractional part of the number. Floating point numbers can be positive or negative.


4 Answers

(Restricting this answer to positive numbers; generalisation is trivial.)

  1. Get the number of bits in your exponent for the float on your platform along with the radix. If you have an IEEE754 32 bit float then this is a trivial step.

  2. Use (1) to compute the largest non-integer value that can be stored in your float. std::numeric_limits doesn't specify this value, annoyingly, so you need to do this yourself. For 32 bit IEEE754 you could take the easy option: 8388607.5 is the largest non-integral type float.

  3. If your float is less than or equal to (2), then check if it's an integer or not. If it's not an integer then you can round it appropriately so as not to invalidate the <.

  4. At this point, the float is an integer. Check if it's within in the range of your long long. If it's out of range then the result of < is known.

  5. If you get this far, then you can safely cast your float to a long long, and make the comparison.

like image 140
Bathsheba Avatar answered Oct 16 '22 08:10

Bathsheba


Here's what I ended up with.

Credit for the algorithm goes to @chux; his approach appears to outperform the other suggestions. You can find some alternative implementations in the edit history.

If you can think of any improvements, suggestions are welcome.

#include <compare>
#include <cmath>
#include <limits>
#include <type_traits>

template <typename I, typename F>
std::partial_ordering compare_int_float(I i, F f)
{
    if constexpr (std::is_integral_v<F> && std::is_floating_point_v<I>)
    {
        return 0 <=> compare_int_float(f, i);
    }
    else
    {
        static_assert(std::is_integral_v<I> && std::is_floating_point_v<F>);
        static_assert(std::numeric_limits<F>::radix == 2);
        
        // This should be exactly representable as F due to being a power of two.
        constexpr F I_min_as_F = std::numeric_limits<I>::min();
        
        // The `numeric_limits<I>::max()` itself might not be representable as F, so we use this instead.
        constexpr F I_max_as_F_plus_1 = F(std::numeric_limits<I>::max()/2+1) * 2;

        // Check if the constants above overflowed to infinity. Normally this shouldn't happen.
        constexpr bool limits_overflow = I_min_as_F * 2 == I_min_as_F || I_max_as_F_plus_1 * 2 == I_max_as_F_plus_1;
        if constexpr (limits_overflow)
        {
            // Manually check for special floating-point values.
            if (std::isinf(f))
                return f > 0 ? std::partial_ordering::less : std::partial_ordering::greater;
            if (std::isnan(f))
                return std::partial_ordering::unordered;
        }

        if (limits_overflow || f >= I_min_as_F)
        {
            // `f <= I_max_as_F_plus_1 - 1` would be problematic due to rounding, so we use this instead.
            if (limits_overflow || f - I_max_as_F_plus_1 <= -1)
            {
                I f_trunc = f;
                if (f_trunc < i)
                    return std::partial_ordering::greater;
                if (f_trunc > i)
                    return std::partial_ordering::less;
            
                F f_frac = f - f_trunc;
                if (f_frac < 0)
                    return std::partial_ordering::greater;
                if (f_frac > 0)
                    return std::partial_ordering::less;
                    
                return std::partial_ordering::equivalent;
            }

            return std::partial_ordering::less;
        }
        
        if (f < 0)
            return std::partial_ordering::greater;
        
        return std::partial_ordering::unordered;
    }
}

If you want to experiment with it, here are a few test cases:

#include <algorithm>
#include <cmath>
#include <iomanip>
#include <iostream> 

void compare_print(long long a, float b, int n = 0)
{
    if (n == 0)
    {
        auto result = compare_int_float(a,b);
        static constexpr std::partial_ordering values[] = {std::partial_ordering::less, std::partial_ordering::equivalent, std::partial_ordering::greater, std::partial_ordering::unordered};
        std::cout << a << ' ' << "<=>?"[std::find(values, values+4, result) - values] << ' ' << b << '\n';
    }
    else
    {
        for (int i = 0; i < n; i++)
            b = std::nextafter(b, -INFINITY);
            
        for (int i = 0; i <= n*2; i++)
        {
            compare_print(a, b);
            b = std::nextafter(b, INFINITY);
        }
        
        std::cout << '\n';
    }
}

int main()
{    
    std::cout << std::setprecision(1000);
    
    compare_print(999999984306749440,
                  999999984306749440.f, 2);
                  
    compare_print(999999984306749439,
                  999999984306749440.f, 2);
                  
    compare_print(100,
                  100.f, 2);
    
    compare_print(-100,
                  -100.f, 2);
                  
    compare_print(0,
                  0.f, 2);
                                    
    compare_print((long long)0x8000'0000'0000'0000,
                  (long long)0x8000'0000'0000'0000, 2);
                                    
    compare_print(42, INFINITY);
    compare_print(42, -INFINITY);
    compare_print(42, NAN);
    std::cout << '\n';

    compare_print(1388608,
                  1388608.f, 2);
    
    compare_print(12388608,
                  12388608.f, 2);
}

(run the code)

like image 36
HolyBlackCat Avatar answered Oct 16 '22 06:10

HolyBlackCat


To compare a FP f and integer i for equality:

(Code is representative and uses comparison of float and long long as an example)

  1. If f is a NaN, infinity, or has a fractional part (perhaps use frexp()), f is not equal to i.

    float ipart;
    // C++
    if (frexp(f, &ipart) != 0) return not_equal;
    // C
    if (frexpf(f, &ipart) != 0) return not_equal;
    
  2. Convert the numeric limits of i into exactly representable FP values (powers of 2) near those limits.** Easy to do if we assume FP is not a rare base 10 encoding and range of double exceeds the range on the i. Take advantage that integer limits magnitudes are or near Mersenne Number. (Sorry example code is C-ish)

    #define FP_INT_MAX_PLUS1 ((LLONG_MAX/2 + 1)*2.0)
    #define FP_INT_MIN (LLONG_MIN*1.0)
    
  3. Compare f to is limits

    if (f >= FP_INT_MAX_PLUS1) return not_equal;
    if (f < FP_INT_MIN) return not_equal;
    
  4. Convert f to integer and compare

    return (long long) f == i;
    

To compare a FP f and integer i for <, >, == or not comparable:

(Using above limits)

  1. Test f >= lower limit

    if (f >= FP_INT_MIN) {
    
  2. Test f <= upper limit

      // reform below to cope with effects of rounding
      // if (f <= FP_INT_MAX_PLUS1 - 1)
      if (f - FP_INT_MAX_PLUS1 <= -1.0) {
    
  3. Convert f to integer/fraction and compare

        // at this point `f` is in the range of `i`
        long long ipart = (long long) f;
        if (ipart < i) return f_less_than_i;
        if (ipart > i) return f_more_than_i;
    
        float frac = f - ipart;
        if (frac < 0) return f_less_than_i;
        if (frac > 0) return f_more_than_i;
        return equal;
      }
    
  4. Handle edge cases

      else return f_more_than_i;
    }
    if (f < 0.0) return f_less_than_i;
    return not_comparable;
    

Simplifications possible, yet I wanted to convey the algorithm.


** Additional conditional code needed to cope with non 2's complement integer encoding. It is quite similar to the MAX code.

like image 3
chux - Reinstate Monica Avatar answered Oct 16 '22 08:10

chux - Reinstate Monica


The code below works with integer data types of at most 64 bits and floating point data types of at most ieee-754 double precision accuracy. For wider data types the same idea can be used, but you'll have to adapt he code. Since I'm not very familiar with C++, the code is written in C. It shouldn't be too difficult to convert it to a C++ style code. The code is branchless, which might be a performance benefit.


#include <stdio.h>
// gcc -O3 -march=haswell cmp.c
// Assume long long int is 64 bits.
// Assume ieee-754 double precision.
int long_long_less_than_double(long long int i, double y) {
    long long i_lo = i & 0x00000000FFFFFFFF;   // Extract lower 32 bits.
    long long i_hi = i & 0xFFFFFFFF00000000;   // Extract upper 32 bits.
    double x_lo = (double)i_lo;                // Exact conversion to double, no rounding errors!
    double x_hi = (double)i_hi;                // 
    return ( x_lo < (y - x_hi) );              // If i is close to y then y - x_hi is exact,
                                               // due to Sterbenz' lemma.
    // i < y
    // i_lo +i_hi < y      
    // i_lo < (y - i_hi)
    // x_lo < (y - x_hi)
}

int long_long_equals_double(long long int i, double y) {
    long long i_lo = i & 0x00000000FFFFFFFF;   
    long long i_hi = i & 0xFFFFFFFF00000000;   
    double x_lo = (double)i_lo;                    
    double x_hi = (double)i_hi;                    
    return ( x_lo == (y - x_hi) );                  
}                                                  


int main()
{
    long long a0 = 999999984306749439;
    long long a1 = 999999984306749440;    // Hex number: 0x0DE0B6B000000000
    long long a2 = 999999984306749441;
    float     b = 999999984306749440.f;   // This number can be represented exactly by a `float`.

    printf("%lli less_than %20.1f = %i\n", a0, b, long_long_less_than_double(a0, b));  // Implicit conversion from float to double
    printf("%lli less_than %20.1f = %i\n", a1, b, long_long_less_than_double(a1, b));

    printf("%lli equals    %20.1f = %i\n", a0, b, long_long_equals_double(a0, b));
    printf("%lli equals    %20.1f = %i\n", a1, b, long_long_equals_double(a1, b));
    printf("%lli equals    %20.1f = %i\n\n", a2, b, long_long_equals_double(a2, b));


    long long c0 = 1311693406324658687;
    long long c1 = 1311693406324658688;   // Hex number: 0x1234123412341200
    long long c2 = 1311693406324658689; 
    double     d = 1311693406324658688.0; // This number can be represented exactly by a `double`.

    printf("%lli less_than %20.1f = %i\n", c0, d, long_long_less_than_double(c0, d));
    printf("%lli less_than %20.1f = %i\n", c1, d, long_long_less_than_double(c1, d));

    printf("%lli equals    %20.1f = %i\n", c0, d, long_long_equals_double(c0, d));
    printf("%lli equals    %20.1f = %i\n", c1, d, long_long_equals_double(c1, d));
    printf("%lli equals    %20.1f = %i\n", c2, d, long_long_equals_double(c2, d));


    return 0;
}

The idea is to split the 64 bits integer i in 32 upper bits i_hi and 32 lower bits i_lo, which are converted to doubles x_hi and x_lo without any rounding errors. If double y is close to x_hi, then the floating point subtraction y - x_hi is exact, due to Sterbenz' lemma. So, instead of x_lo + x_hi < y, we can test for x_lo < (y - x_hi), which is more accurate! If double y is not close to x_hi then y - x_hi is inacurate, but in that case we don't need the accuracy because then |y - x_hi| is much larger than |x_lo|. In other words: If i and y differ much than we don't have to worry about the value of the lower 32 bits.

Output:

    999999984306749439 less_than 999999984306749440.0 = 1
    999999984306749440 less_than 999999984306749440.0 = 0
    999999984306749439 equals    999999984306749440.0 = 0
    999999984306749440 equals    999999984306749440.0 = 1
    999999984306749441 equals    999999984306749440.0 = 0

    1311693406324658687 less_than 1311693406324658688.0 = 1
    1311693406324658688 less_than 1311693406324658688.0 = 0
    1311693406324658687 equals    1311693406324658688.0 = 0
    1311693406324658688 equals    1311693406324658688.0 = 1
    1311693406324658689 equals    1311693406324658688.0 = 0
like image 2
wim Avatar answered Oct 16 '22 06:10

wim