In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG
+ADD
instructions where other compilers would use a single SUB
instruction.
As a simple example, consider the following C code:
uint64_t Mod3(uint64_t value) { return (value % 3); }
ICC translates this to the following machine code (regardless of optimization level):
mov rcx, 0xaaaaaaaaaaaaaaab mov rax, rdi mul rcx shr rdx, 1 lea rsi, QWORD PTR [rdx+rdx*2] neg rsi ; \ equivalent to: add rdi, rsi ; / sub rdi, rsi mov rax, rdi ret
Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG
+ADD
sequence is replaced by a single SUB
instruction.
Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.
Could there be something that Intel knows about the implementation of the SUB
instruction on their processors that makes it more optimal to decompose it into NEG
+ADD
instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB
is broken down internally into individual NEG
and ADD
µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.
Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB
is equally as efficient as ADD
on all processors, so the additional required NEG
instruction just serves to slow things down.
I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB
appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200) Binary Format - 64Bit Architecture - HSW Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 1.85 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations) Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.5 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.8 | 1.7 | 0.0 | --------------------------------------------------------------------------------------- | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | 0.1 | 0.2 | | | | 0.3 | 0.4 | | CP | mov rax, 0xaaaaaaaaaaaaaaab | 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx | 1 | 0.9 | | | | | | 0.1 | | CP | shr rdx, 0x1 | 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2] | 1 | | 0.3 | | | | 0.4 | 0.2 | | CP | sub rcx, rax | 1* | | | | | | | | | | mov rax, rcx Total Num Of Uops: 7
NEG+ADD Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200) Binary Format - 64Bit Architecture - HSW Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 2.15 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations) Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 1.1 0.0 | 2.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | --------------------------------------------------------------------------------------- | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | 0.1 | 0.9 | | | | 0.1 | 0.1 | | | mov rax, 0xaaaaaaaaaaaaaaab | 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx | 1 | 1.0 | | | | | | | | CP | shr rdx, 0x1 | 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2] | 1 | | 0.1 | | | | 0.8 | 0.1 | | CP | neg rax | 1 | 0.1 | | | | | 0.1 | 0.9 | | CP | add rcx, rax | 1* | | | | | | | | | | mov rax, rcx Total Num Of Uops: 8
So, as far as I can tell, NEG
+ADD
increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB
. So why is Intel's compiler doing this?
Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?
ADD is not always faster than INC , but it is almost always at least as fast (there are a few corner cases on certain older micro-architectures, but they are exceedingly rare), and sometimes significantly faster.
After all, both ADD and INC updates flag registers. The only difference is that INC doesn't update CF .
Strangely I have a simple answer: Because ICC isn't optimal.
When you write own compiler you get started with some very basic set of operation codes: NOP
, MOV
, ADD
... up to 10 opcodes. You don't use SUB
for a while because it might easily be replaced by: ADD NEGgative operand
. NEG
isn't basic as well, as it might be replaced by: XOR FFFF...; ADD 1
.
So you implement rather complex bit-based addressing of operand types and sizes. You do it for a single machine code instruction (eg. ADD
) and plan to use it further for most other instructions. But by this time your co-worker finishes implementation of optimal calculation of remainder without use of SUB
! Imagine - it's already called "Optimal_Mod" so you miss some inoptimal thing inside not because you're a bad guy and hate AMD but just because you see - it's already called optimal, optimized.
Intel Compiler is pretty good in general, but it has a long version history, so it can behave strange in some rare cases. I suggest you inform Intel about this issue and look what will happen.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With