I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?
while(input > 0)
{
out = (out << 3) + (out << 1);
out += input % 10;
input = (input >> 8) + (input >> 1);
}
EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.
What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:
unsigned int val = 123; // initial value
unsigned int rem;
rem = val & 0x3; // remainder after value is divided by 4.
// Equivalent to 'val % 4'
rem = val % 5; // remainder after value is divided by 5.
// Because 5 isn't power of two, we can't simply AND it with 5-1(=4).
Why it works? Let's consider a bit pattern for the value 123 which is 1111011
and then the divisor 4, which has the bit pattern of 00000100
. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011
. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011
. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1
and the rest are 0
. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.
However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.
Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.
Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.
This simple code snippet:
int mod(int val) {
return val % 10;
}
Generates the following code on my rather old gcc with -O3:
_mod:
push ebp
mov edx, 1717986919
mov ebp, esp
mov ecx, DWORD PTR [ebp+8]
pop ebp
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 31
sar edx, 2
sub edx, eax
lea eax, [edx+edx*4]
mov edx, ecx
add eax, eax
sub edx, eax
mov eax, edx
ret
If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.
Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient. Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With