While reading through Java 8's Integer class, I come upon the following FIX-ME: (line 379)
// TODO-FIXME: convert (x * 52429) into the equiv shift-add
// sequence.
The entire comment reads:
// I use the "[invariant division by multiplication][2]" trick to
// accelerate Integer.toString. In particular we want to
// avoid division by 10.
//
// The "trick" has roughly the same performance characteristics
// as the "classic" Integer.toString code on a non-JIT VM.
// The trick avoids .rem and .div calls but has a longer code
// path and is thus dominated by dispatch overhead. In the
// JIT case the dispatch overhead doesn't exist and the
// "trick" is considerably faster than the classic code.
//
// TODO-FIXME: convert (x * 52429) into the equiv shift-add
// sequence.
//
// RE: Division by Invariant Integers using Multiplication
// T Gralund, P Montgomery
// ACM PLDI 1994
//
I cannot imagine that I should be worried about this, as this has been present for quite a while.
But, can someone shed light on what this FIX-ME means and if has any side-effects?
Side notes:
'FIXME' is for things which are definitely broken, but where you want to not worry about it for the moment. 'TODO' is for useful features, optimizations or refactorings that might be worth doing in the future.
In-IDE. IDE extension that lets you fix coding issues before they exist!
FixMe.IT is a remote support software that enables users to connect to any remote computer. It caters to users from sole proprietors to global corporations across multiple sectors and industries.
52429 is the closest integer to (2 ^ 19) / 10, so division by 10 can be achieved by multiplying by 52429, and then dividing by 2 ^ 19, where the latter is a trivial bit shift operation instead of requiring a full division.
The code author appears to be suggesting that the multiplication could be done more optimally using shift/add operations instead, per this (C language) snippet:
uint32_t div10(uint16_t in)
{
// divides by multiplying by 52429 / (2 ^ 16)
// 52429 = 0xcccd
uint32_t x = in << 2; // multiply by 4 : total = 0x0004
x += (x << 1); // multiply by 3 : total = 0x000c
x += (x << 4); // multiply by 17 : total = 0x00cc
x += (x << 8); // multiply by 257 : total = 0xcccc
x += in; // one more makes : total = 0xcccd
return x >> 19;
}
What I can't answer is why they apparently thought this might be more optimal than a straight multiplication in a Java environment.
At the machine code level it would only be more optimal on a (nowadays rare) CPU without a hardware multiplier where the simplest (albeit perhaps naïve) multiply function would need 16 shift/add operations to multiply two 16-bit numbers.
On the other hand a hand-crafted function like the above can perform the multiplication by a constant in fewer steps by exploiting the numeric properties of that constant, in this case reducing it to four shift/add operations instead of 16.
FWIW (and somewhat impressively) the clang compiler on macOS even with just the -O1
optimisation flag actually converts that code above back into a single multiplication:
_div10: ## @div10
pushq %rbp
movq %rsp, %rbp
imull $52429, %edi, %eax ## imm = 0xCCCD
shrl $19, %eax
popq %rbp
retq
It also turns:
uint32_t div10(uint16_t in) {
return in / 10;
}
into exactly the same assembly code, which just goes to show that modern compilers really do know best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With