x86_64: is IMUL faster than 2x SHL + 2x ADD?

Tags:

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:

int64_t calc(int64_t a) {
  return (a << 6) + (a << 16) - a;
}

Assembly:

  imul        rdx,qword ptr [a],1003Fh

So I was wondering if that is really faster than doing it the way it is written, something like:

  mov         rbx,qword ptr [a]  
  mov         rax,rbx  
  shl         rax,6  
  mov         rcx,rbx  
  shl         rcx,10h  
  add         rax,rcx  
  sub         rax,rbx

I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

421

asked Jun 20 '16 14:06

rustyx

1 Answers

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m and imul r, r/m, imm are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.

On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).

Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.

The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).

Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.

Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.

Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.

gcc and clang3.8 and later use up to two LEA instructions instead of imul r, r/m, imm. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.

That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.

clang3.7 and earlier tends to favour imul except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)

e.g. this code on the Godbolt compiler explorer:

int foo (int a) { return a * 63; }
    # gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
    mov     eax, edi  # tmp91, a
    sal     eax, 6    # tmp91,
    sub     eax, edi  # tmp92, a
    ret

clang3.8 and later makes the same code.

answered Dec 04 '22 11:12

Peter Cordes

Related questions
                            
                                Is it faster to UPDATE a row, or to DELETE it and INSERT a new one?
                            
                                Mapping from String to integer - performance of various approaches
                            
                                Performance of fwrite and write size
                            
                                Simple Core Data fetch is very slow
                            
                                An optimized method to compare IP addresses with wildcards in PHP?
                            
                                What's the most efficient way of checking whether a string begins with a certain character in TCL?
                            
                                How can I render line faster than CGContextStrokePath?
                            
                                Problematic Random Forest training runtime when using formula interface
                            
                                Open Source New Relic alternative [closed]
                            
                                nonclustered index on primary key column?
                            
                                Performance iText vs.PdfBox
                            
                                Fastest way to reset a value in every struct element of a vector?
                            
                                C# List .ConvertAll Efficiency and overhead
                            
                                Why do Promise libraries use event loops?
                            
                                Python NUMPY HUGE Matrices multiplication
                            
                                Count days between two dates with Java 8 while ignoring certain days of week
                            
                                JMeter: What is the purpose of tearDown Thread Group
                            
                                How can I plot many thousands of circles quickly?
                            
                                Spring security Access Control List Billions of rows
                            
                                Performance hit of newing up Error object in modern Node.js

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

x86_64: is IMUL faster than 2x SHL + 2x ADD?

Tags:

performance

assembly

x86-64

intel

multiplication

rustyx

People also ask

1 Answers

Peter Cordes

Recent Activity

Donate For Us