I'm using the Zarith library to do arbitrary-precision rational arithmetic. Suppose I have a rational number q
of type Q.t
that is the ratio of two big integers (Q
is Zarith's arbitrary-precision rational number module). Sometimes, I would like to print this number as a floating point number for readability, and sometimes I need to convert this number floating-point for later non-arbitrary-precision calculations. Is there a way to convert q
to a floating-point number up to a certain level of precision?
The way I am converting q
to floating-point now has no guarantees, and can create undefined floating-point numbers (Z
is the arbitrary-precision integer module):
let to_float q =
let n, d = num q, den q in
(* check if d is zero and raise an error if it is *)
let nf, df = Z.to_float n, Z.to_float d in
nf /. df
Is there a better way to handle this, where I can obtain a floating-point that most accurately approximates any q
?
Edit
I quickly wrote up Mark Dickinson's answer in OCaml, if anyone is interested. It can probably (definitely) be improved and cleaned up. I'll edit if I do so or if anyone has any advice for improvements. But for now this has solved my problem!
let to_float q =
let n, d = num q, den q in
let n_sign = Z.sign n in
let d_sign = Z.sign d in (* always >= 0 *)
if d_sign = 0 then raise Division_by_zero;
let n = Z.abs n in
if n_sign = 0 then 0. else
let shift = (Z.numbits n) - (Z.numbits d) - 55 in
let is_subnormal = shift < -1076 in
let shift = if is_subnormal then -1076 else shift in
let d = if shift >= 0 then Z.shift_left d shift else d in
let n = if shift < 0 then Z.shift_left n (-shift)
else n in
let quotient, remainder = Z.div_rem n d in
let quotient = if (Z.compare remainder (Z.zero)) = 0 && Z.is_even quotient then
Z.add Z.one quotient else quotient in
let quotient = if not is_subnormal then quotient else
let round_select = Z.to_int @@ Z.rem quotient @@ Z.of_int 8 in
Z.add quotient [|Z.zero;Z.minus_one;Z.of_int (-2);Z.one;Z.zero
;Z.minus_one;Z.of_int 2;Z.one|].(round_select)
in
let unsigned_res = ldexp (Z.to_float quotient) shift in
if n_sign = 1 then unsigned_res else -.unsigned_res
I'll look into writing an interface for GMP's mpq_get_d
function later, but I'm not entirely sure how to do that. The only way I see how to do that is to convert q : Q.t
to a string and pass that to:
int mpq_set_str (mpq_t rop, const char *str, int base)
Does anyone know how to then pass rop
to mpq_get_d
in OCaml or have a reference that describes how to do this? I looked through chapter 19 of RWO and didn't see a situation like this.
If you have access to
log2
operation, and then it's relatively easy to roll your own correctly rounded conversion. In a nutshell, the method looks like this:
n > 0
, d > 0
; filter out obvious underflow/overflowshift
so that 2^-shift*n/d
lies between 2^54
and 2^56
.x = 2^-shift*n/d
, rounded to the nearest integer using the round-to-odd rounding method.x
to the nearest IEEE 754 double-precision value dx
, with the usual round-ties-to-even rounding mode.ldexp(dx, shift)
.I'm afraid I'm not fluent in OCaml, but the following Python code illustrates the idea for positive inputs. I leave it to you to make the obvious modifications for negative inputs and for division by zero. You might also want to make an early return for cases of extreme overflow and underflow: those are easy to detect by looking for extra large or small values of shift
below.
from math import ldexp
def to_float(numerator, denominator):
"""
Convert numerator / denominator to float, correctly rounded.
For simplicity, assume both inputs are positive.
"""
# Shift satisfies 2**54 < (numerator / denominator) / 2**shift < 2**56
shift = numerator.bit_length() - denominator.bit_length() - 55
# Divide the fraction by 2**shift.
if shift >= 0:
denominator <<= shift
else:
numerator <<= -shift
# Convert to the nearest integer, using round-to-odd.
q, r = divmod(numerator, denominator)
if r != 0 and q % 2 == 0:
q += 1
# Now convert to the nearest float and shift back.
return ldexp(float(q), shift)
Some notes:
bit_length
method on a positive integer n
gives the number of bits required to represent n
, or in other words 1 + floor(log2(n))
.divmod
is a Python function that computes the quotient and remainder of an integer division at the same time.q
fits (easily) in a 64-bit integernumerator / denominator
to the nearest integer, and again when rounding that integer to a float. The first round uses the round-to-odd method; this ensures that the second round (implicit in the conversion from int to float) gives the same result as if we'd rounded the fraction directly to a float.ldexp
operation introduces potentially a third rounding. It's possible to deal with this, with some care. See below for some code.The above is in fact a simplified version of the algorithm that Python uses when dividing one (big) integer by another to get a floating-point result. You can see the source here. The comment at the beginning of the long_true_divide
function gives an overview of the method.
For completeness, here's a variant that also deals correctly with subnormal results.
def to_float(numerator, denominator):
"""
Convert numerator / denominator to float, correctly rounded.
For simplicity, assume both inputs are positive.
"""
# Choose shift so that 2**54 < numerator / denominator / 2**shift < 2**56
shift = numerator.bit_length() - denominator.bit_length() - 55
# The 'treat_as_subnormal' flag catches all cases of subnormal results,
# along with some cases where the result is not subnormal but *is* still
# smaller than 2**-1021. In all these cases, it's sufficient to find the
# closest integer multiple of 2**-1074. We first round to the nearest
# multiple of 2**-1076 using round-to-odd.
treat_as_subnormal = shift < -1076
if treat_as_subnormal:
shift = -1076
# Divide the fraction by 2**shift.
if shift >= 0:
denominator <<= shift
else:
numerator <<= -shift
# Convert to the nearest integer, using round-to-odd.
q, r = divmod(numerator, denominator)
if r != 0 and q % 2 == 0:
q += 1
# Now convert to the nearest float and shift back.
if treat_as_subnormal:
# Round to the nearest multiple of 4, rounding ties to
# the nearest multiple of 8. This avoids double rounding
# from the ldexp call below.
q += [0, -1, -2, 1, 0, -1, 2, 1][q%8]
return ldexp(float(q), shift)
This isn't a complete answer, but in looking around I found that Zarith uses GMP internally. There is a GMP function named mpq_get_d
that converts a rational to a double. If it's not available directly in Zarith, it should be possible (given some time) to add an interface for it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With