Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are RMW instructions considered harmful on modern x86?

I recall that read-modify-write instructions are generally to be avoided when optimizing x86 for speed. That is, you should avoid something like add [rsi], 10, which adds to the memory location stored in rsi. The recommendation was usually to split it into a read-modify instruction, followed by a store, so something like:

mov rax, 10
add rax, [rsp]
mov [rsp], rax

Alternately, you might use explicit load and stores and a reg-reg add operation:

mov rax, [esp]
add rax, 10
mov [rsp], rax

Is this still reasonable advice (and was it ever?) for modern x86?1

Of course, in cases where a value from memory is used more than once, RMW is inappropriate, since you will incur redundant loads and stores. I'm interested in the case where a value is only used once.

Based on exploration in Godbolt, all of icc, clang and gcc prefer to use a single RMW instruction to compile something like:

void Foo::f() {
  x += 10;
}

into:

Foo::f():
    add     QWORD PTR [rdi], 10
    ret

So at least most compilers seem to think RMW is fine, when the value is only used once.

Interestingly enough, the various compilers do not agree when the incremented value is a global, rather than a member, such as:

int global;

void g() {
  global += 10;
}

In this case, gcc and clang still a single RMW instruction, while icc prefers a reg-reg add with explicit loads and stores:

g():
        mov       eax, DWORD PTR global[rip]                    #5.3
        add       eax, 10                                       #5.3
        mov       DWORD PTR global[rip], eax                    #5.3
        ret     

Perhaps it is something to do with RIP relative addressing and micro-fusion limitations? However, icc13 still does the same thing with -m32 so perhaps it's more to do with the addressing mode requiring a 32-bit displacement.


1I'm using the deliberately vague term modern x86 to basically mean the last few generations of Intel and AMD laptop/desktop/server chips.

like image 332
BeeOnRope Avatar asked Jun 26 '16 01:06

BeeOnRope


1 Answers

Are RMW instructions considered harmful on modern x86?

No.

On modern x86/x64 the input instructions are translated into uops.
Any RMW instruction will be broken down into a number of uops; in fact into the same uops that separate instructions would be broken down into.

By using a 'complex' RMW instruction instead of separate 'simple' read, modify and write instructions you gain the following.

  1. Fewer instructions to decode.
  2. Better utilization of the instruction cache
  3. Better utilization of the addressable registers

You can see this quite clearly in Agner Fog's instruction tables.

ADD [mem],const has a latency of 5 cycles.

MOV [mem],reg and visa versa has a latency of 2 cycles each and an ADD reg,const has a latency of 1 for a total of 5.

I checked the timings for Intel Skylake, but AMD K10 is the same.

You need to take into account that compilers have to cater to many different processors and some compilers even use the same core logic for different processor families. This can lead to quite suboptimal strategies.

RIP relative addressing
On X64 RIP relative addressing takes an extra cycle to resolve RIP on older processors.
Skylake does not have this delay and I'm sure others will eliminate the delay as well.
I'm sure you're aware that x86 does not support EIP relative addressing; on X86 you have to do this in a round-about fashion.

like image 144
Johan Avatar answered Dec 21 '22 19:12

Johan