Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AND faster than integer modulo operation?

It is possible to re-express:

  • i % m

as:

  • i & (m-1)

where,

  • i is an unsigned integer
  • m is a power of 2

My question is: is the AND operation any faster? Don't modern CPUs support integer modulo in hardware in a single instruction? I'm interested in ARM, but don't see the modulo operation in its instruction set.

like image 916
user48956 Avatar asked Oct 06 '11 16:10

user48956


People also ask

Is modulo faster than if?

A modulo-operation is very slow. An if is most likely to be faster than a modulo and more readable.

What is integer modulo operation?

In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another (called the modulus of the operation).

Is modulo operator slower?

Thanks! Integer division and modulo are relatively slow because there is no direct hardware support (they compile to multiple instruction sequences). Floating point modulo is fast. Integer modulo is also slow on CPUs for the same reason.


2 Answers

You may be interested in Embedded Live: Embedded Programmers' Guide to ARM’s Cortex-M Architecture.

The ARM Cortex-M family has unsigned and singed division instructions, UDIV and SDIV, which take 2 to 12 cycles. There is no MOD instruction, but equivalent result is obtained by a {S,U}DIV followed by the multiply-and-subtract instruction MLS, which takes 2 cycles, for a total of 4-14 cycles.

The AND instruction is single cycle, therefore 4-14x faster.

like image 124
Adriano Mitre Avatar answered Nov 08 '22 21:11

Adriano Mitre


It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.

In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.

In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):

unsigned page_size_bits = 12;     // optimization works even without const here

unsigned foo(unsigned address) {
  unsigned page_size = 1U << page_size_bits;
  return address / page_size;
}

The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.

As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.

(Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even x % 4 is more expensive than x & 3 for signed int x.)

like image 30
John Ripley Avatar answered Nov 08 '22 21:11

John Ripley