Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Golang floating point precision float32 vs float64

I wrote a program to demonstrate floating point error in Go:

func main() {     a := float64(0.2)      a += 0.1     a -= 0.3     var i int     for i = 0; a < 1.0; i++ {         a += a     }     fmt.Printf("After %d iterations, a = %e\n", i, a) } 

It prints:

After 54 iterations, a = 1.000000e+00 

This matches the behaviour of the same program written in C (using the double type)

However, if float32 is used instead, the program gets stuck in an infinite loop! If you modify the C program to use a float instead of a double, it prints

After 27 iterations, a = 1.600000e+00 

Why doesn't the Go program have the same output as the C program when using float32?

like image 333
charliehorse55 Avatar asked Mar 11 '14 21:03

charliehorse55


People also ask

Should I use float32 or float64?

float32 is less accurate but faster than float64, and flaot64 is more accurate than float32 but consumes more memory. If accuracy is more important than speed , you can use float64. and if speed is more important than accuracy, you can use float32.

What is the difference between float32 and float64 Golang?

Floating Point Types Go has two floating point types - float32 and float64 . float32 occupies 32 bits in memory and stores values in single-precision floating point format. float64 occupies 64 bits in memory and stores values in double-precision floating point format.

Is float 32-bit or 64-bit?

Floats generally come in two flavours: “single” and “double” precision. Single precision floats are 32-bits in length while “doubles” are 64-bits. Due to the finite size of floats, they cannot represent all of the real numbers - there are limitations on both their precision and range.

Is float and float32 same?

float is one of the available numeric data types in Go used to store decimal numbers. float32 is a version of float that stores decimal values composed of 32 bits of data.


2 Answers

Using math.Float32bits and math.Float64bits, you can see how Go represents the different decimal values as a IEEE 754 binary value:

Playground: https://play.golang.org/p/ZqzdCZLfvC

Result:

float32(0.1): 00111101110011001100110011001101 float32(0.2): 00111110010011001100110011001101 float32(0.3): 00111110100110011001100110011010 float64(0.1): 0011111110111001100110011001100110011001100110011001100110011010 float64(0.2): 0011111111001001100110011001100110011001100110011001100110011010 float64(0.3): 0011111111010011001100110011001100110011001100110011001100110011 

If you convert these binary representation to decimal values and do your loop, you can see that for float32, the initial value of a will be:

0.20000000298023224 + 0.10000000149011612 - 0.30000001192092896 = -7.4505806e-9 

a negative value that can never never sum up to 1.

So, why does C behave different?

If you look at the binary pattern (and know slightly about how to represent binary values), you can see that Go rounds the last bit while I assume C just crops it instead.

So, in a sense, while neither Go nor C can represent 0.1 exactly in a float, Go uses the value closest to 0.1:

Go:   00111101110011001100110011001101 => 0.10000000149011612 C(?): 00111101110011001100110011001100 => 0.09999999403953552 

Edit:

I posted a question about how C handles float constants, and from the answer it seems that any implementation of the C standard is allowed to do either. The implementation you tried it with just did it differently than Go.

like image 148
ANisus Avatar answered Oct 01 '22 05:10

ANisus


Agree with ANisus, go is doing the right thing. Concerning C, I'm not convinced by his guess.

The C standard does not dictate, but most implementations of libc will convert the decimal representation to nearest float (at least to comply with IEEE-754 2008 or ISO 10967), so I don't think this is the most probable explanation.

There are several reasons why the C program behavior might differ... Especially, some intermediate computations might be performed with excess precision (double or long double).

The most probable thing I can think of, is if ever you wrote 0.1 instead of 0.1f in C.
In which case, you might have cause excess precision in initialization
(you sum float a+double 0.1 => the float is converted to double, then result is converted back to float)

If I emulate these operations

float32(float32(float32(0.2) + float64(0.1)) - float64(0.3)) 

Then I find something near 1.1920929e-8f

After 27 iterations, this sums to 1.6f

like image 34
aka.nice Avatar answered Oct 01 '22 05:10

aka.nice