Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

Question

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).

Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.

But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:

Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?

Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.

Pascal Cuoq · Accepted Answer

I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.

By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to ³/₄ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.

Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).

Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:

if ((repr & 7) == 4) 
  repr += (repr & 8) >> 1);   /* midpoint case */
else
  repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */

Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

Tags:

math

floating-point

truncated

Jakub Wasilewski

1 Answers

Pascal Cuoq

Recent Activity

Donate For Us

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

Tags:

math

floating-point

truncated

Jakub Wasilewski

1 Answers

Pascal Cuoq

Related questions

Recent Activity

Donate For Us