There's this related question: GCC: how is march different from mtune?
However, the existing answers don't go much further than the GCC manual itself. At most, we get:
If you use
-mtune
, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated.
and
The
-mtune=Y
option tunes the generated code to run faster on Y than on other CPUs it might run on.
But exactly how does GCC favor one specific architecture, when bulding, while still being capable of running the build on other (usually older) architectures, albeit slower?
I only know of one thing (but I'm no computer scientist) which would be capable of such, and that's a CPU dispatcher. However, it doesn't seem (for me) that mtune
is generating a dispatcher behind the scenes, and instead some other mechanism is probably in effect.
I feel that way for two reasons:
mtune
) and test for cpuid
to detect supported instructions at runtime, instead of relying on a named architecture which is provided at build time.So how does it work really?
-mtune
doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.
From the GCC docs:
-mtune=cpu-type
Tune to cpu-type everything applicable about the generated code, except for the ABI and the
set of available instructions.
This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.
To understand this last statement is necessary to understand the difference between architecture and micro-architecture.
The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune
.
The micro-architecture is how the architecture is implemented in hardware.
For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation.
This can go as far as having a code sequence being optimal only on one micro-architecture.
When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.
It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.
When we use -mtune=x
we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.
As a concrete example consider how this code is compiled:
float bar(float a[4], float b[4])
{
for (int i = 0; i < 4; i++)
{
a[i] += b[i];
}
float r=0;
for (int i = 0; i < 4; i++)
{
r += a[i];
}
return r;
}
The a[i] += b[i];
is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:
Skylake
movups xmm0, XMMWORD PTR [rsi]
movups xmm2, XMMWORD PTR [rdi]
addps xmm0, xmm2
movups XMMWORD PTR [rdi], xmm0
movss xmm0, DWORD PTR [rdi]
Core2
pxor xmm0, xmm0
pxor xmm1, xmm1
movlps xmm0, QWORD PTR [rdi]
movlps xmm1, QWORD PTR [rsi]
movhps xmm1, QWORD PTR [rsi+8]
movhps xmm0, QWORD PTR [rdi+8]
addps xmm0, xmm1
movlps QWORD PTR [rdi], xmm0
movhps QWORD PTR [rdi+8], xmm0
movss xmm0, DWORD PTR [rdi]
The main difference is how an xmm
register is loaded, on a Core2 it is loaded with two loads using movlps
and movhps
instead of using a single movups
.
The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups
is decoded into 4 uops and has a latency of 2 cycles while each movXps
is 1 uop and 1 cycle of latency.
This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.
On Skylake the opposite is true: movups
performs better than two movXps
.
So we have to pick up one.
In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune
.
1 Instruction set is selected with other switches.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With