I have a 32 bit floating point f
number (known to be positive) that I need to convert to 32 bit unsigned integer. It's magnitude might be too large to fit. Furthermore, there is downstream computation that requires some headroom. I can compute the maximum acceptable value m
as a 32 bit integer. How do I efficiently determine in C++11 on a constrained 32 bit machine (ARM M4F) if f <= m
mathematically. Note that the types of the two values don't match. The following three approaches each have their issues:
static_cast<uint32_t>(f) <= m
: I think this triggers undefined behaviour if f
doesn't fit the 32 bit integerf <= static_cast<float>(m)
: if m
is too large to be converted exactly, the converted value could be larger than m
such that the subsequent comparison will produce the wrong result in certain edge casesstatic_cast<double>(f) <= static_cast<double>(m)
: is mathematically correct, but requires casting to, and working with double, which I'd like to avoid for efficiency reasonsSurely there must be a way to convert an integer to a float directly with specified rounding direction, i.e. guaranteeing the result not to exceed the input in magnitude. I'd prefer a C++11 standard solution, but in the worst case platform intrinsics could qualify as well.
The integer is a data type used to define a number that contains all positive, negative or zero non-fractional values. These cannot have decimals. Float is a data type used to define a number that has a fractional value. These can have decimals also.
A new format, called 32-bit float in audio circles, encodes audio in an IEEE-754 standard single precision format: 1 bit for positive or negative; 8 bit exponent; and 23 bit fraction. Translated into decibels, that gives a range of more than 1500 dB. That's way more range than you'll ever need.
Float is a floating point number which is already able to store the whole range of Int64, no matter using single/double precision. And actually all the current Haxe targets use double precision floating point number for Float, which is 64-bit.
The 'int pointer' size can be changed to 64 bits on 64 bits machines, since the memory address size is 64 bits. That means your 'argument' isn't valid. A float is then still a float too: usually we say it is 32 bits, but everyone is free to deviate from it.
I think your best bet is to be a bit platform specific. 2³² can be represented precisely in floating point. Check if f
is too large to fit at all, and then convert to unsigned and check against m
.
const float unsigned_limit = 4294967296.0f;
bool ok = false;
if (f < unsigned_limit)
{
const auto uf = static_cast<unsigned int>(f);
if (uf <= m)
{
ok = true;
}
}
Not fond of the double comparison, but it's clear.
If f
is usually significantly less than m
(or usually significantly greater), one can test against float(m)*0.99f
(respectively float(m)*1.01f
), and then do the exact comparison in the unusual case. That is probably only worth doing if profiling shows that the performance gain is worth the extra complexity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With