<pre class="prettyprint"><code>void main() { float f = 0.98; if(f <= 0.98) printf("hi"); else printf("hello"); getch(); } </code></pre> I am getting this problem here.On using different floating point values of f i am getting different results. Why this is happening?

<code>f</code> is using <code>float</code> precision, but 0.98 is in <code>double</code> precision by default, so the statement <code>f <= 0.98</code> is compared using <code>double</code> precision. The <code>f</code> is therefore converted to a <code>double</code> in the comparison, but may make the result slightly larger than 0.98. Use <pre class="prettyprint"><code>if(f <= 0.98f) </code></pre> or use a <code>double</code> for <code>f</code> instead. <hr> In detail... assuming <code>float</code> is IEEE single-precision and <code>double</code> is IEEE double-precision. These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal: <pre class="prettyprint"><code>0.98 = 0.1111101011100001010001111010111000010100011110101110000101000... </code></pre> A <code>float</code> can only store 24 bits of significant figures, i.e. <pre class="prettyprint"><code> 0.111110101110000101000111_101... ^ round off here = 0.111110101110000101001000 = 16441672 / 2^24 = 0.98000001907... </code></pre> A <code>double</code> can store 53 bits of signficant figures, so <pre class="prettyprint"><code> 0.11111010111000010100011110101110000101000111101011100_00101000... ^ round off here = 0.11111010111000010100011110101110000101000111101011100 = 8827055269646172 / 2^53 = 0.97999999999999998224... </code></pre> So the 0.98 will become slightly larger in <code>float</code> and smaller in <code>double</code>.

problems in floating point comparison [duplicate]

Tags:

c

floating-point

floating-point-conversion

void main()
{
    float f = 0.98;
    if(f <= 0.98)
        printf("hi");
    else
        printf("hello");
    getch();
}

I am getting this problem here.On using different floating point values of f i am getting different results. Why this is happening?

253

asked Oct 18 '10 19:10

Adi

2 Answers

f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.

The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.

Use

if(f <= 0.98f)

or use a double for f instead.

In detail... assuming float is IEEE single-precision and double is IEEE double-precision.

These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:

0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...

A float can only store 24 bits of significant figures, i.e.

       0.111110101110000101000111_101...
                                 ^ round off here
   =   0.111110101110000101001000

   =   16441672 / 2^24

   =   0.98000001907...

A double can store 53 bits of signficant figures, so

       0.11111010111000010100011110101110000101000111101011100_00101000...
                                                              ^ round off here
   =   0.11111010111000010100011110101110000101000111101011100

   =   8827055269646172 / 2^53

   =   0.97999999999999998224...

So the 0.98 will become slightly larger in float and smaller in double.

answered Nov 04 '22 14:11

kennytm

It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.

Read more about this at http://en.wikipedia.org/wiki/Floating_point

An example (from encountering this problem in my VB6 days)

To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.

Bit 1 is the sign bit (is it negative [1] or position [0]) Bits 2-9 are for the exponent value Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )

So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):

s --exp--- -------mantissa--------
0 01111111 00011001100110011001100

If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.

sign = 0 = a positive number

exponent: 01111111 = 127

mantissa: 00011001100110011001100 = 838860

With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).

To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).

So here is the math:

(1 + 099999904632568359375) * 2^(127-127)

1.099999904632568359375 * 1 = 1.099999904632568359375

As you can see 1.1 is not really stored in the single floating point value as 1.1.

answered Nov 04 '22 15:11

Matt

Related questions
                            
                                Using void pointer to an array
                            
                                unresolved inclusion in the java header in JNI
                            
                                Cannot convert Gray to BGR in OpenCV
                            
                                What does wait() do on Unix?
                            
                                Initializing a C struct array, with a size not known at compile time
                            
                                Should I really be using make?
                            
                                Writing new "malloc" and "free" functions [closed]
                            
                                C/C++: printf use commas instead of dots as decimal separator
                            
                                safe malloc/realloc: wrapping the call into a macro?
                            
                                starting address of array a and &a
                            
                                How much optimized is Vala generated C code over hand written C code?
                            
                                sizeof operator returns 4 for (char + short ) [duplicate]
                            
                                Where's the 24th fraction bit on a single precision float? IEEE 754
                            
                                Reading binary file in C (in chunks)
                            
                                What does while(*pointer) means in C?
                            
                                C print first million Fibonacci numbers
                            
                                How is pow() calculated in C?
                            
                                Getting the offset of a variable inside a struct is based on the NULL pointer, but why?
                            
                                Passing pointers of arrays in C
                            
                                Determining the type of an expression

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With