Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly

I have a 32 bit floating point f number (known to be positive) that I need to convert to 32 bit unsigned integer. It's magnitude might be too large to fit. Furthermore, there is downstream computation that requires some headroom. I can compute the maximum acceptable value m as a 32 bit integer. How do I efficiently determine in C++11 on a constrained 32 bit machine (ARM M4F) if f <= m mathematically. Note that the types of the two values don't match. The following three approaches each have their issues:

  • static_cast<uint32_t>(f) <= m: I think this triggers undefined behaviour if f doesn't fit the 32 bit integer
  • f <= static_cast<float>(m): if m is too large to be converted exactly, the converted value could be larger than m such that the subsequent comparison will produce the wrong result in certain edge cases
  • static_cast<double>(f) <= static_cast<double>(m): is mathematically correct, but requires casting to, and working with double, which I'd like to avoid for efficiency reasons

Surely there must be a way to convert an integer to a float directly with specified rounding direction, i.e. guaranteeing the result not to exceed the input in magnitude. I'd prefer a C++11 standard solution, but in the worst case platform intrinsics could qualify as well.

like image 598
burnpanck Avatar asked May 09 '17 06:05

burnpanck


People also ask

How does integer compare to float?

The integer is a data type used to define a number that contains all positive, negative or zero non-fractional values. These cannot have decimals. Float is a data type used to define a number that has a fractional value. These can have decimals also.

What is 32-bit float?

A new format, called 32-bit float in audio circles, encodes audio in an IEEE-754 standard single precision format: 1 bit for positive or negative; 8 bit exponent; and 23 bit fraction. Translated into decibels, that gives a range of more than 1500 dB. That's way more range than you'll ever need.

Is Int64 same as float?

Float is a floating point number which is already able to store the whole range of Int64, no matter using single/double precision. And actually all the current Haxe targets use double precision floating point number for Float, which is 64-bit.

Is float always 32-bit?

The 'int pointer' size can be changed to 64 bits on 64 bits machines, since the memory address size is 64 bits. That means your 'argument' isn't valid. A float is then still a float too: usually we say it is 32 bits, but everyone is free to deviate from it.


1 Answers

I think your best bet is to be a bit platform specific. 2³² can be represented precisely in floating point. Check if f is too large to fit at all, and then convert to unsigned and check against m.

const float unsigned_limit = 4294967296.0f;
bool ok = false;
if (f < unsigned_limit)
{
    const auto uf = static_cast<unsigned int>(f);
    if (uf <= m)
    {
        ok = true;
    }
}

Not fond of the double comparison, but it's clear.

If f is usually significantly less than m (or usually significantly greater), one can test against float(m)*0.99f (respectively float(m)*1.01f), and then do the exact comparison in the unusual case. That is probably only worth doing if profiling shows that the performance gain is worth the extra complexity.

like image 116
Martin Bonner supports Monica Avatar answered Oct 05 '22 14:10

Martin Bonner supports Monica