Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x86_64: is IMUL faster than 2x SHL + 2x ADD?

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:

int64_t calc(int64_t a) {
  return (a << 6) + (a << 16) - a;
}

Assembly:

  imul        rdx,qword ptr [a],1003Fh  

So I was wondering if that is really faster than doing it the way it is written, something like:

  mov         rbx,qword ptr [a]  
  mov         rax,rbx  
  shl         rax,6  
  mov         rcx,rbx  
  shl         rcx,10h  
  add         rax,rcx  
  sub         rax,rbx  

I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

like image 421
rustyx Avatar asked Jun 20 '16 14:06

rustyx


People also ask

What is Imul instruction?

Description. The single-operand form of imul executes a signed multiply of a byte, word, or long by the contents of the AL, AX, or EAX register and stores the product in the AX, DX:AX or EDX:EAX register respectively.

How many operand S should be specified to use the Imul instruction?

The imul instruction has two basic formats: two-operand (first two syntax listings above) and three-operand (last two syntax listings above). The two-operand form multiplies its two operands together and stores the result in the first operand. The result (i.e. first) operand must be a register.

Where is the result of Imul stored?

If the operand is byte sized, it is multiplied by the byte in the AL register and the result is stored in the 16 bits of AX.


1 Answers

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m and imul r, r/m, imm are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.

On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).

Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.


The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).

Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.

Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.


Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.

gcc and clang3.8 and later use up to two LEA instructions instead of imul r, r/m, imm. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.

That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.

clang3.7 and earlier tends to favour imul except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)

e.g. this code on the Godbolt compiler explorer:

int foo (int a) { return a * 63; }
    # gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
    mov     eax, edi  # tmp91, a
    sal     eax, 6    # tmp91,
    sub     eax, edi  # tmp92, a
    ret

clang3.8 and later makes the same code.

like image 98
Peter Cordes Avatar answered Dec 04 '22 11:12

Peter Cordes