When looking at the assembly produced by Visual Studio (2015U2) in /O2
(release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:
int64_t calc(int64_t a) {
return (a << 6) + (a << 16) - a;
}
Assembly:
imul rdx,qword ptr [a],1003Fh
So I was wondering if that is really faster than doing it the way it is written, something like:
mov rbx,qword ptr [a]
mov rax,rbx
shl rax,6
mov rcx,rbx
shl rcx,10h
add rax,rcx
sub rax,rbx
I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?
Description. The single-operand form of imul executes a signed multiply of a byte, word, or long by the contents of the AL, AX, or EAX register and stores the product in the AX, DX:AX or EDX:EAX register respectively.
The imul instruction has two basic formats: two-operand (first two syntax listings above) and three-operand (last two syntax listings above). The two-operand form multiplies its two operands together and stores the result in the first operand. The result (i.e. first) operand must be a register.
If the operand is byte sized, it is multiplied by the byte in the AL register and the result is stored in the 16 bits of AX.
That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.imul r, r/m
and imul r, r/m, imm
are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.
On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).
Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.
The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).
Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext
/pdep
unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.
Fun fact: imul r64
is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32
is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.
Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.
gcc and clang3.8 and later use up to two LEA
instructions instead of imul r, r/m, imm
. I think gcc will use imul
if the alternative is 3 or more instructions (not including mov
), though.
That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul
on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.
clang3.7 and earlier tends to favour imul
except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)
e.g. this code on the Godbolt compiler explorer:
int foo (int a) { return a * 63; }
# gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
mov eax, edi # tmp91, a
sal eax, 6 # tmp91,
sub eax, edi # tmp92, a
ret
clang3.8 and later makes the same code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With