Rounding the result of division to the nearest integer is pretty simple. But I'm trying to round a result of division so that the subsequent operation will give the best approximation. That's best explained by a simple function:
const int halfbits = std::numeric_limits<unsigned int>::digits / 2;
unsigned x = foo(); // Likely big.
unsigned x_div = x >> halfbits; // Rounded down
unsigned y = x_div * x_div; // Will fit due to division.
I can round x_div
to nearest by adding 1<<(halfbits-1)
. But since x² is not a linear function, y in general is not rounded correctly. Is there a simple and more accurate way to calculate (x*x) >> (halfbits*2)
without using bigger types?
I think that adding 3<<(halfbits-3)
to x_div improves the rounding, but can't prove that's the best solution. Also, can this be generalized for xⁿ ?
Edit: by popular demand I take the liberty of "translating" the question in pure arithmetic terms (none of these C bit-shifting thing...).
Note: All divisions thereafter are integer divisons, for example 13/3 would be 4.
Problem: We can't compute x^2 because x is big, so instead we'd like to compute (x^2)/(2^N).
To do so we compute
x_div = x / sqrt(2^N)
which we then square:
y = x_div * x_div
However, this result is typically short of the exact value of (x^2)/(2^N)
and the OP suggests adding 0.5 * sqrt(2^N) or maybe 0.375 * sqrt(2^N) to better approximate the result...
As Oli Charlesworth
's answer suggests there's a much better way to get to the actual value, by thinking of x^2
as (x_hi + x_lo)^2.
Whereas with x_div
truncation will cause an error magnitude of at most 1, with x_div*x_div
the error could be up to 1<<(half_digits+2)
.
To see why, consider that we can express this squaring as follows (using long multplication):
x * x = (x_lo + x_hi) * (x_lo + x_hi)
= x_lo^2 + x_hi^2 + 2*x_lo*x_hi
where x_lo
and x_hi
are the lower and upper halves of x
, respectively. With some nice ASCII art, we can consider how these all line up:
MSB : : : LSB
+------+------+ : :
| x_hi^2 | : :
+-----++-----++-----+: :
: | 2*x_lo*x_hi |: :
: +------++-----++------+
: : | x_lo^2 |
: : +------+------+
:<----------->: : :
Our result
As we can see, the lower-order terms affect multiple bits in the final result.
However, each of these terms should fit into the original type without overflow. So with appropriate shifting and masking, we can get the desired result.
Of course, the compiler/hardware does this all for you if you use a bigger type, so you should just do that if you have the option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With