For some integer type, how can I find the value that is closest to some value of a floating-point type even when the floating point value is far outside the representable range of the integer.
Or more precisely:
Let F
be a floating-point type (probably float
, double
, or long double
).
Let I
be an integer type.
Assume that both F
and I
have valid specializations of std::numeric_limits<>
.
Given a representable value of F
, and using only C++03, how can I find the closest representable value of I
?
I am after in a pure, efficient, and thread-safe solution, and one that assumes nothing about the platform except what is guaranteed by C++03.
If such an solution does not exist, is it possible to find one using the new features of C99/C++11?
Using lround()
of C99 seems to be problematic due to the non-trivial way in which domain errors are reported. Can these domain errors be caught in a portable and thread-safe way?
Note: I am aware that Boost probably offers a solution via its boost::numerics::converter<>
template, but due to its high complexity and verbosity, and I have not been able to extract the essentials from it, and therefore I have not been able to check whether their solution makes assumptions beyond C++03.
The following naive approach fails due to the fact that the result of I(f)
is undefined by C++03 when the integral part of f
is not a representable value of I
.
template<class I, class F> I closest_int(F f)
{
return I(f);
}
Consider then the following approach:
template<class I, class F> I closest_int(F f)
{
if (f < std::numeric_limits<I>::min()) return std::numeric_limits<I>::min();
if (std::numeric_limits<I>::max() < f) return std::numeric_limits<I>::max();
return I(f);
}
This also fails because the integral parts of F(std::numeric_limits<I>::min())
and F(std::numeric_limits<I>::max())
may still not be representable in I
.
Finally consider this third approach which also fails:
template<class I, class F> I closest_int(F f)
{
if (f <= std::numeric_limits<I>::min()) return std::numeric_limits<I>::min();
if (std::numeric_limits<I>::max() <= f) return std::numeric_limits<I>::max();
return I(f);
}
This time I(f)
will always have a well-defined result, however, since F(std::numeric_limits<I>::max())
may be much smaller than std::numeric_limits<I>::max()
, it is possible that we will return std::numeric_limits<I>::max()
for a floating-point value that is multiple integer values below std::numeric_limits<I>::max()
.
Note that all the trouble arises because it is undefined whether the conversion F(i)
rounds up, or down to the closest representable floating-point value.
Here is the relevant section from C++03 (4.9 Floating-integral conversions):
An rvalue of an integer type or of an enumeration type can be converted to an rvalue of a floating point type. The result is exact if possible. Otherwise, it is an implementation-defined choice of either the next lower or higher representable value.
I have a practical solution for radix-2 (binary) floating-point types and integer types up to and longer than 64-bit. See below. The comments should be clear. Output follows.
// file: f2i.cpp
//
// compiled with MinGW x86 (gcc version 4.6.2) as:
// g++ -Wall -O2 -std=c++03 f2i.cpp -o f2i.exe
#include <iostream>
#include <iomanip>
#include <limits>
using namespace std;
template<class I, class F> I truncAndCap(F f)
{
/*
This function converts (by truncating the
fractional part) the floating-point value f (of type F)
into an integer value (of type I), avoiding undefined
behavior by returning std::numeric_limits<I>::min() and
std::numeric_limits<I>::max() when f is too small or
too big to be converted to type I directly.
2 problems:
- F may fail to convert to I,
which is undefined behavior and we want to avoid that.
- I may not convert exactly into F
- Direct I & F comparison fails because of I to F promotion,
which can be inexact.
This solution is for the most practical case when I and F
are radix-2 (binary) integer and floating-point types.
*/
int Idigits = numeric_limits<I>::digits;
int Isigned = numeric_limits<I>::is_signed;
/*
Calculate cutOffMax = 2 ^ std::numeric_limits<I>::digits
(where ^ denotes exponentiation) as a value of type F.
We assume that F is a radix-2 (binary) floating-point type AND
it has a big enough exponent part to hold the value of
std::numeric_limits<I>::digits.
FLT_MAX_10_EXP/DBL_MAX_10_EXP/LDBL_MAX_10_EXP >= 37
(guaranteed per C++ standard from 2003/C standard from 1999)
corresponds to log2(1e37) ~= 122, so the type I can contain
up to 122 bits. In practice, integers longer than 64 bits
are extremely rare (if existent at all), especially on old systems
of the 2003 C++ standard's time.
*/
const F cutOffMax = F(I(1) << Idigits / 2) * F(I(1) << (Idigits / 2 + Idigits % 2));
if (f >= cutOffMax)
return numeric_limits<I>::max();
/*
Calculate cutOffMin = - 2 ^ std::numeric_limits<I>::digits
(where ^ denotes exponentiation) as a value of type F for
signed I's OR cutOffMin = 0 for unsigned I's in a similar fashion.
*/
const F cutOffMin = Isigned ? -F(I(1) << Idigits / 2) * F(I(1) << (Idigits / 2 + Idigits % 2)) : 0;
if (f <= cutOffMin)
return numeric_limits<I>::min();
/*
Mathematically, we may still have a little problem (2 cases):
cutOffMin < f < std::numeric_limits<I>::min()
srd::numeric_limits<I>::max() < f < cutOffMax
These cases are only possible when f isn't a whole number, when
it's either std::numeric_limits<I>::min() - value in the range (0,1)
or std::numeric_limits<I>::max() + value in the range (0,1).
We can ignore this altogether because converting f to type I is
guaranteed to truncate the fractional part off, and therefore
I(f) will always be in the range
[std::numeric_limits<I>::min(), std::numeric_limits<I>::max()].
*/
return I(f);
}
template<class I, class F> void test(const char* msg, F f)
{
I i = truncAndCap<I,F>(f);
cout <<
msg <<
setiosflags(ios_base::showpos) <<
setw(14) << setprecision(12) <<
f << " -> " <<
i <<
resetiosflags(ios_base::showpos) <<
endl;
}
#define TEST(I,F,VAL) \
test<I,F>(#F " -> " #I ": ", VAL);
int main()
{
TEST(short, float, -1.75f);
TEST(short, float, -1.25f);
TEST(short, float, +0.00f);
TEST(short, float, +1.25f);
TEST(short, float, +1.75f);
TEST(short, float, -32769.00f);
TEST(short, float, -32768.50f);
TEST(short, float, -32768.00f);
TEST(short, float, -32767.75f);
TEST(short, float, -32767.25f);
TEST(short, float, -32767.00f);
TEST(short, float, -32766.00f);
TEST(short, float, +32766.00f);
TEST(short, float, +32767.00f);
TEST(short, float, +32767.25f);
TEST(short, float, +32767.75f);
TEST(short, float, +32768.00f);
TEST(short, float, +32768.50f);
TEST(short, float, +32769.00f);
TEST(int, float, -2147483904.00f);
TEST(int, float, -2147483648.00f);
TEST(int, float, -16777218.00f);
TEST(int, float, -16777216.00f);
TEST(int, float, -16777215.00f);
TEST(int, float, +16777215.00f);
TEST(int, float, +16777216.00f);
TEST(int, float, +16777218.00f);
TEST(int, float, +2147483648.00f);
TEST(int, float, +2147483904.00f);
TEST(int, double, -2147483649.00);
TEST(int, double, -2147483648.00);
TEST(int, double, -2147483647.75);
TEST(int, double, -2147483647.25);
TEST(int, double, -2147483647.00);
TEST(int, double, +2147483647.00);
TEST(int, double, +2147483647.25);
TEST(int, double, +2147483647.75);
TEST(int, double, +2147483648.00);
TEST(int, double, +2147483649.00);
TEST(unsigned, double, -1.00);
TEST(unsigned, double, +1.00);
TEST(unsigned, double, +4294967295.00);
TEST(unsigned, double, +4294967295.25);
TEST(unsigned, double, +4294967295.75);
TEST(unsigned, double, +4294967296.00);
TEST(unsigned, double, +4294967297.00);
return 0;
}
Output (ideone prints the same as my PC):
float -> short: -1.75 -> -1
float -> short: -1.25 -> -1
float -> short: +0 -> +0
float -> short: +1.25 -> +1
float -> short: +1.75 -> +1
float -> short: -32769 -> -32768
float -> short: -32768.5 -> -32768
float -> short: -32768 -> -32768
float -> short: -32767.75 -> -32767
float -> short: -32767.25 -> -32767
float -> short: -32767 -> -32767
float -> short: -32766 -> -32766
float -> short: +32766 -> +32766
float -> short: +32767 -> +32767
float -> short: +32767.25 -> +32767
float -> short: +32767.75 -> +32767
float -> short: +32768 -> +32767
float -> short: +32768.5 -> +32767
float -> short: +32769 -> +32767
float -> int: -2147483904 -> -2147483648
float -> int: -2147483648 -> -2147483648
float -> int: -16777218 -> -16777218
float -> int: -16777216 -> -16777216
float -> int: -16777215 -> -16777215
float -> int: +16777215 -> +16777215
float -> int: +16777216 -> +16777216
float -> int: +16777218 -> +16777218
float -> int: +2147483648 -> +2147483647
float -> int: +2147483904 -> +2147483647
double -> int: -2147483649 -> -2147483648
double -> int: -2147483648 -> -2147483648
double -> int: -2147483647.75 -> -2147483647
double -> int: -2147483647.25 -> -2147483647
double -> int: -2147483647 -> -2147483647
double -> int: +2147483647 -> +2147483647
double -> int: +2147483647.25 -> +2147483647
double -> int: +2147483647.75 -> +2147483647
double -> int: +2147483648 -> +2147483647
double -> int: +2147483649 -> +2147483647
double -> unsigned: -1 -> 0
double -> unsigned: +1 -> 1
double -> unsigned: +4294967295 -> 4294967295
double -> unsigned: +4294967295.25 -> 4294967295
double -> unsigned: +4294967295.75 -> 4294967295
double -> unsigned: +4294967296 -> 4294967295
double -> unsigned: +4294967297 -> 4294967295
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With