AND faster than integer modulo operation?

Tags:

It is possible to re-express:

i % m

as:

i & (m-1)

where,

i is an unsigned integer
m is a power of 2

My question is: is the AND operation any faster? Don't modern CPUs support integer modulo in hardware in a single instruction? I'm interested in ARM, but don't see the modulo operation in its instruction set.

916

asked Oct 06 '11 16:10

user48956

2 Answers

You may be interested in Embedded Live: Embedded Programmers' Guide to ARM’s Cortex-M Architecture.

The ARM Cortex-M family has unsigned and singed division instructions, UDIV and SDIV, which take 2 to 12 cycles. There is no MOD instruction, but equivalent result is obtained by a {S,U}DIV followed by the multiply-and-subtract instruction MLS, which takes 2 cycles, for a total of 4-14 cycles.

The AND instruction is single cycle, therefore 4-14x faster.

124

answered Nov 08 '22 21:11

Adriano Mitre

It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.

In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.

In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):

unsigned page_size_bits = 12;     // optimization works even without const here

unsigned foo(unsigned address) {
  unsigned page_size = 1U << page_size_bits;
  return address / page_size;
}

The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.

As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.

(Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even x % 4 is more expensive than x & 3 for signed int x.)

answered Nov 08 '22 21:11

John Ripley

Related questions
                            
                                Can compilers generate self modifying code?
                            
                                Z80 ASM BNF structure... am I on the right track?
                            
                                g++ 4.6.1 compiler error: Error: unknown pseudo-op: `.cfi_personality'
                            
                                Is it possible to make a custom Interrupt in Assembly?
                            
                                How does GMP stores its integers, on an arbitrary number of bytes?
                            
                                mirror bits of a 32 bit word
                            
                                Measure CPU speed by counting assembly instructions
                            
                                what does "outb" in AT&T asm mean?
                            
                                Using a virtual machine to learn assembly
                            
                                How do I decompile a .hex file into C++ for Arduino?
                            
                                Difference between .equ and .word in ARM Assembly?
                            
                                Linux kernel assembly and logic
                            
                                With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]
                            
                                Programming Environment for a Motorola 68000 in Linux
                            
                                LOOP, LOOPE, LOOPNE?
                            
                                Equivalent for GCC's naked attribute
                            
                                How does argument passing work?
                            
                                Why is this inline assembly not working with a separate asm volatile statement for each instruction?
                            
                                How to make the kernel for my bootloader?
                            
                                Are programming languages and methods inefficient? (assembler and C knowledge needed)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AND faster than integer modulo operation?

Tags:

cpu-architecture

assembly

micro-optimization

arm

user48956

People also ask

2 Answers

Adriano Mitre

John Ripley

Recent Activity

Donate For Us