I recall that read-modify-write instructions are generally to be avoided when optimizing x86 for speed. That is, you should avoid something like add [rsi], 10
, which adds to the memory location stored in rsi
. The recommendation was usually to split it into a read-modify instruction, followed by a store, so something like:
mov rax, 10
add rax, [rsp]
mov [rsp], rax
Alternately, you might use explicit load and stores and a reg-reg add operation:
mov rax, [esp]
add rax, 10
mov [rsp], rax
Is this still reasonable advice (and was it ever?) for modern x86?1
Of course, in cases where a value from memory is used more than once, RMW is inappropriate, since you will incur redundant loads and stores. I'm interested in the case where a value is only used once.
Based on exploration in Godbolt, all of icc, clang and gcc prefer to use a single RMW instruction to compile something like:
void Foo::f() {
x += 10;
}
into:
Foo::f():
add QWORD PTR [rdi], 10
ret
So at least most compilers seem to think RMW is fine, when the value is only used once.
Interestingly enough, the various compilers do not agree when the incremented value is a global, rather than a member, such as:
int global;
void g() {
global += 10;
}
In this case, gcc
and clang
still a single RMW instruction, while icc
prefers a reg-reg add with explicit loads and stores:
g():
mov eax, DWORD PTR global[rip] #5.3
add eax, 10 #5.3
mov DWORD PTR global[rip], eax #5.3
ret
Perhaps it is something to do with RIP
relative addressing and micro-fusion limitations? However, icc13 still does the same thing with -m32
so perhaps it's more to do with the addressing mode requiring a 32-bit displacement.
1I'm using the deliberately vague term modern x86 to basically mean the last few generations of Intel and AMD laptop/desktop/server chips.
Are RMW instructions considered harmful on modern x86?
No.
On modern x86/x64 the input instructions are translated into uops.
Any RMW instruction will be broken down into a number of uops; in fact into the same uops that separate instructions would be broken down into.
By using a 'complex' RMW instruction instead of separate 'simple' read, modify and write instructions you gain the following.
You can see this quite clearly in Agner Fog's instruction tables.
ADD [mem],const
has a latency of 5 cycles.
MOV [mem],reg
and visa versa has a latency of 2 cycles each and an ADD reg,const
has a latency of 1 for a total of 5.
I checked the timings for Intel Skylake, but AMD K10 is the same.
You need to take into account that compilers have to cater to many different processors and some compilers even use the same core logic for different processor families. This can lead to quite suboptimal strategies.
RIP relative addressing
On X64 RIP relative addressing takes an extra cycle to resolve RIP on older processors.
Skylake does not have this delay and I'm sure others will eliminate the delay as well.
I'm sure you're aware that x86 does not support EIP relative addressing; on X86 you have to do this in a round-about fashion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With