How to convert a uint64_t to a double/float between 0 and 1 with maximum accuracy (C++)?

Tags:

I'm writing an image class based on unsigned integers. I'm using uint8_t and uint16_t buffers currently for 8-bit and 16-bit RGBA pixels, and to convert from 16-bit to 8-bit I simply have to take the 16 bit value, divide by std::numeric_limits< uint16_t >::max() converted to a double, then multiply that by 255.

However, if I wanted to have an image with 64-bit unsigned integers for each RGBA component (I know, it's absurdly high), how would I go about finding a float/double between 0 and 1 that represents how far between 0 and the max uint64_t my pixel value is? I assume that converting to doubles wouldn't work because doubles are generally 64-bit floats, and you can't capture all 64-bit unsigned integer values in a 64-bit float. Dividing without converting to floats/doubles would just give me 0 or sometimes 1.

What is the most accurate way to find a floating point value between 0 and 1 that represents how far between 0 and the maximum possible an unsigned 64-bit value is?

803

asked Oct 24 '17 01:10

Thomas

2 Answers

What is the most accurate way to find a floating point value between 0 and 1 that represents how far between 0 and the maximum possible an unsigned 64-bit value is?

To map integer values in the range [0...2⁶⁴) to [0 ... 1.0) can be done directly.

Convert from uint64_t to double.

Scale by 2⁶⁴@Mark Ransom

 #define TWO63 0x8000000000000000u 
 #define TWO64f (TWO63*2.0)

 double map(uint64_t u) {
   double y = (double) u; 
   return y/Two64f;
 }

The will map

Integer values in the range [2⁶³...2⁶⁴) to [0.5 ... 1.0): 2⁵² different double values.
Integer values in the range [2⁶²...2⁶³) to [0.25 ... 0.5): 2⁵² different double values.
Integer values in the range [2⁶¹...2⁶²) to [0.125 ... 0.25): 2⁵² different double values.
...
Integer values in the range [2⁵²...2⁵³) to [2^-12 ... 2^-11): 2⁵² different double values.
Integer values in the range [0...2⁵²) to [2^-13 ... 2^-12): 2⁵² different double values.

To map integer values in the range [0...2⁶⁴) to [0 ... 1.0] is more difficult. (Note the ] vs. ).

[Feb 2021] I see this answer needs re-explanation on upper edge cases. Potential values returned include 1.0.

137

answered Sep 20 '22 16:09

chux - Reinstate Monica

You can get a start from the following code for Java's java.util.Random nextDouble() method. It takes 53 bits and forms a double from them:

   return (((long)next(26) << 27) + next(27))
     / (double)(1L << 53);

I would use the most significant 26 bits of your long for the shifted value, and the next 27 bits to fill in the low order bits. That discards the least significant 64-53 = 11 bits of the input.

If distinguishing very small values is especially important you could also use subnormal numbers, which nextDouble() does not return.

answered Sep 17 '22 16:09

Patricia Shanahan

Related questions
                            
                                Using a parameter's name inside its own default value - is it legal?
                            
                                R: How to write interruptible C++ function, and recover partial results
                            
                                Prevent implicit conversion but allow list initialisation?
                            
                                What is an event loop in Qt?
                            
                                Do C++ compilers optimize repeated function calls?
                            
                                Conversion Function with decltype(auto) in C++14
                            
                                doxygen is generating empty documentation
                            
                                Difference between cv::Mat::t () and cv::transpose()
                            
                                Double inclusion and headers only library stbi_image
                            
                                How can I know if a USB device is already in use?
                            
                                This code, why does it have to show undefined behavior?
                            
                                how to read extreme long lines from text file fast and safe in C++?
                            
                                ADL fails (or not done?) for function with additional (non deduced) template parameter
                            
                                Nullify QString data bytes
                            
                                cannot find boost_process cmake find_package
                            
                                What is a dangling reference? [duplicate]
                            
                                Is it possible to draw a Drawable and get a bitmap from it?
                            
                                How to optimize memory access pattern / cache misses for this array decimate/downsample program?
                            
                                Can't understand name lookup differences between an int and a user defined type - perhaps ADL related
                            
                                What are the drawbacks of single source project structures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a uint64_t to a double/float between 0 and 1 with maximum accuracy (C++)?

Tags:

c++

floating-point

64-bit

Thomas

People also ask

2 Answers

chux - Reinstate Monica

Patricia Shanahan

Recent Activity

Donate For Us