Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't clang use memory-destination x86 instructions when I compile with optimization disabled? Are they efficient?

I wrote this simple assembly code, ran it and looked at the memory location using GDB:

    .text

.global _main

_main:
    pushq   %rbp
    movl    $5, -4(%rbp)
    addl    $6, -4(%rbp)
    popq    %rbp
    ret

It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.

Now writing the same thing in C and compiling it to assembly turns out like this:

...  # clang output
    xorl    %eax, %eax
    movl    $0, -4(%rbp)
    movl    $5, -8(%rbp)
    movl    -8(%rbp), %ecx   # load a
    addl    $6, %ecx         # a += 6
    movl    %ecx, -8(%rbp)   # store a
....

It's moving them to a register before adding them together.

So why don't we add directly in memory?

Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?

Edit: Here is the C code for the second assembly block, I have disabled optimization when compiling.

#include <iostream>

int main(){
 int a = 5;
 a+=6; 
 return 0;
}
like image 628
Sam Avatar asked Jan 27 '19 18:01

Sam


People also ask

Is Clang better than GCC?

Clang is much faster and uses far less memory than GCC. Clang aims to provide extremely clear and concise diagnostics (error and warning messages), and includes support for expressive diagnostics. GCC's warnings are sometimes acceptable, but are often confusing and it does not support expressive diagnostics.

What is the difference between LLVM and Clang?

LLVM is a backend compiler meant to build compilers on top of it. It deals with optimizations and production of code adapted to the target architecture. CLang is a front end which parses C, C++ and Objective C code and translates it into a representation suitable for LLVM.

How do you compile with Clang?

2.4. To compile a C++ program on the command line, run the clang++ compiler as follows: $ scl enable llvm-toolset-6.0 'clang++ -o output_file source_file ...' This creates a binary file named output_file in the current working directory. If the -o option is omitted, the clang++ compiler creates a file named a.

Does Clang use GCC?

Clang is compatible with GCC. Its command-line interface shares many of GCC's flags and options. Clang implements many GNU language extensions and compiler intrinsics, some of which are purely for compatibility.


1 Answers

You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.

Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.

(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)

Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.


You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.

A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.

There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)

But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.

Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state. So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.

We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.

Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.


is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?

In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)

Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)

x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.

You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.

Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.


Avoiding slowdowns requires looking at more than 1 instruction at once

Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.

e.g. this code will cause a partial-register stall on Intel P6-family CPUs:

mov  ah, 1
add  eax, 123

Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.

But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:

  • Assembly - How to score a CPU instruction by latency and throughput
  • How many CPU cycles are needed for each assembly instruction?
  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).

But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.

GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)

main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $5, -4(%rbp)
    addl    $6, -4(%rbp)
    movl    $0, %eax
    popq    %rbp
    ret

How to see clang/LLVM using memory-destination add

I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)

void add_reg_to_mem(int *p, int b) {
    *p += b;
}

 # I used AT&T syntax because that's what you were using.  Intel-syntax is nicer IMO
    addl    %esi, (%rdi)
    ret

void add_imm_to_mem(int *p) {
    *p += 3;
}

  # gcc and clang -O3 both emit the same asm here, where there's only one good choice
    addl    $3, (%rdi)
    ret

The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.

    # gcc8.2 -O0 output
    ... after making a stack frame and spilling `p` from RDI to -8(%rbp)
    movq    -8(%rbp), %rax        # load p
    movl    (%rax), %eax          # load *p, clobbering p
    leal    3(%rax), %edx         # edx = *p + 3
    movq    -8(%rbp), %rax        # reload p
    movl    %edx, (%rax)          # store *p + 3

GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.

The clang -O0 output happens to be less horrible for this:

 # clang -O0
   ... after making a stack frame and spilling `p` from RDI to -8(%rbp)
    movq    -8(%rbp), %rdi    # reload p
    movl    (%rdi), %eax      # eax = *p
    addl    $3, %eax          # eax += 3
    movl    %eax, (%rdi)      # *p = eax

See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.


If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:

The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.

# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
    movl    4(%esp), %eax    # load p from the stack, because of the 32-bit calling convention

    movl    (%eax), %edx     # *p += 3 implemented as 3 separate instructions
    addl    $3, %edx
    movl    %edx, (%eax)
    ret

You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.

Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).

In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.

GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.

The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.

like image 76
Peter Cordes Avatar answered Oct 22 '22 17:10

Peter Cordes