Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Intel's compiler prefer NEG+ADD over SUB?

In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG+ADD instructions where other compilers would use a single SUB instruction.

As a simple example, consider the following C code:

uint64_t Mod3(uint64_t value) {     return (value % 3); } 

ICC translates this to the following machine code (regardless of optimization level):

mov       rcx, 0xaaaaaaaaaaaaaaab mov       rax, rdi mul       rcx shr       rdx, 1 lea       rsi, QWORD PTR [rdx+rdx*2] neg       rsi                            ; \  equivalent to: add       rdi, rsi                       ; /    sub  rdi, rsi mov       rax, rdi ret          

Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG+ADD sequence is replaced by a single SUB instruction.

Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.

Could there be something that Intel knows about the implementation of the SUB instruction on their processors that makes it more optimal to decompose it into NEG+ADD instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB is broken down internally into individual NEG and ADD µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.

Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB is equally as efficient as ADD on all processors, so the additional required NEG instruction just serves to slow things down.

I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:

SUB
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200) Binary Format - 64Bit Architecture  - HSW Analysis Type - Throughput  Throughput Analysis Report -------------------------- Block Throughput: 1.85 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)  Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   | --------------------------------------------------------------------------------------- | Cycles | 1.0    0.0  | 1.5  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.8  | 1.7  | 0.0  | ---------------------------------------------------------------------------------------  | Num Of |                    Ports pressure in cycles                     |    | |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    | --------------------------------------------------------------------------------- |   1    | 0.1       | 0.2 |           |           |     | 0.3 | 0.4 |     | CP | mov rax, 0xaaaaaaaaaaaaaaab |   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx |   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | shr rdx, 0x1 |   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2] |   1    |           | 0.3 |           |           |     | 0.4 | 0.2 |     | CP | sub rcx, rax |   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx Total Num Of Uops: 7 
NEG+ADD
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200) Binary Format - 64Bit Architecture  - HSW Analysis Type - Throughput  Throughput Analysis Report -------------------------- Block Throughput: 2.15 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)  Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   | --------------------------------------------------------------------------------------- | Cycles | 1.1    0.0  | 2.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 2.0  | 2.0  | 0.0  | ---------------------------------------------------------------------------------------  | Num Of |                    Ports pressure in cycles                     |    | |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    | --------------------------------------------------------------------------------- |   1    | 0.1       | 0.9 |           |           |     | 0.1 | 0.1 |     |    | mov rax, 0xaaaaaaaaaaaaaaab |   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx |   1    | 1.0       |     |           |           |     |     |     |     | CP | shr rdx, 0x1 |   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2] |   1    |           | 0.1 |           |           |     | 0.8 | 0.1 |     | CP | neg rax |   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add rcx, rax |   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx Total Num Of Uops: 8 

So, as far as I can tell, NEG+ADD increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB. So why is Intel's compiler doing this?

Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?

like image 604
Cody Gray Avatar asked Jun 02 '17 13:06

Cody Gray


People also ask

Is INC faster than add?

ADD is not always faster than INC , but it is almost always at least as fast (there are a few corner cases on certain older micro-architectures, but they are exceedingly rare), and sometimes significantly faster.

What is the difference between INC and add instructions?

After all, both ADD and INC updates flag registers. The only difference is that INC doesn't update CF .


1 Answers

Strangely I have a simple answer: Because ICC isn't optimal.

When you write own compiler you get started with some very basic set of operation codes: NOP, MOV, ADD... up to 10 opcodes. You don't use SUB for a while because it might easily be replaced by: ADD NEGgative operand. NEG isn't basic as well, as it might be replaced by: XOR FFFF...; ADD 1.

So you implement rather complex bit-based addressing of operand types and sizes. You do it for a single machine code instruction (eg. ADD) and plan to use it further for most other instructions. But by this time your co-worker finishes implementation of optimal calculation of remainder without use of SUB! Imagine - it's already called "Optimal_Mod" so you miss some inoptimal thing inside not because you're a bad guy and hate AMD but just because you see - it's already called optimal, optimized.

Intel Compiler is pretty good in general, but it has a long version history, so it can behave strange in some rare cases. I suggest you inform Intel about this issue and look what will happen.

like image 117
kay27 Avatar answered Oct 23 '22 23:10

kay27