Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decompose floating-point number

Given a floating-point number, I would like to separate it into a sum of parts, each with a given number of bits. For example, given 3.1415926535 and told to separate it into base-10 parts of 4 digits each, it would return 3.141 + 5.926E-4 + 5.350E-8. Actually, I want to separate a double (which has 52 bits of precision) into three parts with 18 bits of precision each, but it was easier to explain with a base-10 example. I am not necessarily averse to tricks that use the internal representation of a standard double-precision IEEE float, but I would really prefer a solution that stays purely in the floating point realm so as to avoid any issues with endian-dependency or non-standard floating point representations.

No, this is not a homework problem, and, yes, this has a practical use. If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type. Starting from this kind of decomposition, then multiplying all the parts and convolving, is one way to do that. Yes, I could also use an arbitrary-precision floating-point library, but this approach is likely to be faster when only a few parts are involved, and it will definitely be lighter-weight.

like image 678
David Wright Avatar asked Aug 27 '13 06:08

David Wright


People also ask

Why is 0.1 not represented as a float?

The number 0.1 in floating-point The finite representation of 1/10 is 0.0 0011 ‾ 0.0\overline{0011} 0.00011, but it can't be represented in floating-point because we can't deal with bars in floating-point. We can represent it only in fixed digits/bits using any data type.

Is 2.0 a floating-point number?

A Floating Point number usually has a decimal point. This means that 0, 3.14, 6.5, and -125.5 are Floating Point numbers.

Is floating math broken?

Your language isn't broken, it's doing floating point math. Computers can only natively store integers, so they need some way of representing decimal numbers. This representation is not perfectly accurate.

Is a floating-point number a real number?

Floating-point numbers are essentially approximations of real numbers. Some languages, such as Fortran, use the term real to denote floating-point numbers – but they aren't real. Also, because they are approximations, they are susceptible to errors.


1 Answers

If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type.

Exactly. This technique can be found in Veltkamp/Dekker multiplication. While accessing the bits of the representation as in other answers is a possibility, you can also do with only floating-point operations. There is one instance in this blog post. The part you are interested in is:

Input: f; coef is 1 + 2^N
 p = f * coef;
 q = f - p;
 h = p + q;  // h contains the 53-N highest bits of f
 l = f - h;  // l contains the N lowest bits of f

*, -, and + must be exactly the IEEE 754 operations at the precision of f for this to work. On Intel architectures, these operations are provided by the SSE2 instruction set. Visual C sets the precision of the historical FPU to 53 bits in the prelude of the C programs it compiles, which also helps.

like image 148
Pascal Cuoq Avatar answered Sep 28 '22 03:09

Pascal Cuoq