Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re implement modulo using bit shifts?

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?

    while(input > 0)
{
    out = (out << 3) + (out << 1);
    out += input % 10;

    input = (input >> 8) + (input >> 1);
}

EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.

like image 403
PgrAm Avatar asked Jun 18 '12 01:06

PgrAm


2 Answers

What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:

unsigned int val = 123; // initial value
unsigned int rem;

rem = val & 0x3; // remainder after value is divided by 4. 
                 // Equivalent to 'val % 4'
rem = val % 5;   // remainder after value is divided by 5.
                 // Because 5 isn't power of two, we can't simply AND it with 5-1(=4). 

Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.

However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.

Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.

like image 178
zxcdw Avatar answered Sep 20 '22 06:09

zxcdw


Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.

This simple code snippet:

int mod(int val) {
   return val % 10;
}

Generates the following code on my rather old gcc with -O3:

_mod:
        push    ebp
        mov     edx, 1717986919
        mov     ebp, esp
        mov     ecx, DWORD PTR [ebp+8]
        pop     ebp
        mov     eax, ecx
        imul    edx
        mov     eax, ecx
        sar     eax, 31
        sar     edx, 2
        sub     edx, eax
        lea     eax, [edx+edx*4]
        mov     edx, ecx
        add     eax, eax
        sub     edx, eax
        mov     eax, edx
        ret

If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.

Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient. Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.

like image 24
Voo Avatar answered Sep 21 '22 06:09

Voo