Consider the following code:
double x(double a,double b) {
return a*(float)b;
}
It does a conversion form double
to float
than again to double
and multiplies.
When I compile it with gcc 9.1
with -O3
on x86/64
I get:
x(double, double):
movapd xmm2, xmm0
pxor xmm0, xmm0
cvtsd2ss xmm1, xmm1
cvtss2sd xmm0, xmm1
mulsd xmm0, xmm2
ret
With clang
and older versions of gcc
I get this:
x(double, double):
cvtsd2ss xmm1, xmm1
cvtss2sd xmm1, xmm1
mulsd xmm0, xmm1
ret
Here I do not copy xmm0
into xmm2
which seems unnecessary to me.
With gcc 9.1
and -Os
I get:
x(double, double):
movapd xmm2, xmm0
cvtsd2ss xmm1, xmm1
cvtss2sd xmm0, xmm1
mulsd xmm0, xmm2
ret
So it just removes the instruction which sets xmm0
to zero but not the moveapd
.
I believe all three versions are correct, so could there be a performance benefit from the gcc 9.1 -O3
version? And if yes why? Does the pxor xmm0, xmm0
instruction has any benefit?
The issue is similar to Assembly code redundancy in optimized C code, but I don't think its the same because older versions of gcc
do not generate the unnecessary copy.
This is a GCC missed optimization; this is unfortunately not rare for GCC in tiny functions when its register allocator does a poor job with hard-register constraints imposed by the calling convention; apparently GCC is not usually dumb like this between parts of larger functions.
The pxor
-zeroing is there to break the (false) output dependency of cvtss2sd
, which exists because of Intel's short-sighted design for single-source scalar instructions to leave the upper part of the destination vector unmodified. They started this with SSE1 for PIII, where it gave a short-term gain because PIII handled XMM regs as two 64-bit halves, so only writing one half let instructions like sqrtss
be single-uop.
But they unfortunately kept this pattern even for SSE2 (new with Pentium 4). And later declined to fix it with the AVX version of SSE instructions. So compilers are stuck choosing between the risks of creating a long loop-carried dependency chain through a false dependency, or of using pxor-zeroing. GCC conservatively always uses pxor at -O3
, omitting it at -Os
. (2-source operations like mulsd
already depend on the destination as an input so this is unnecessary).
In this case, with its poor choice of register allocation, leaving out pxor
-zeroing would mean that converting (float)b
back to double
couldn't start until a
was ready. So if the critical path was a
being ready (b
ready early), omitting it would increase the latency from a
->result by 5 cycles on Skylake (for the 2-uop cvtss2sd
to run only after a
was ready, because the output has to merge into the register that originally held a
.) Otherwise it's just the mulsd
that has to wait for a
, with all the stuff involving b
done ahead of time.
foo same,same
is another way to work around an output dependency; that's what clang is doing. (And what GCC tries to do for popcnt
, which unexpectedly has one on Sandybridge-family that's not architecturally required, unlike these stupid SSE ones.)
BTW, AVX 3-operand instructions do sometimes provide a way to work around the false dependencies, using a "cold" register, or one that was xor-zeroed, as the register to merge into. Including for scalar int->FP, although clang sometimes just uses movd
plus packed-conversion for that.
Related: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? (I should have just linked that, I forgot I already wrote this up in that much detail on Stack Overflow recently.)
The movapd
and pxor
zeroing don't cost any latency on modern CPUs, but nothing is ever free. They still cost a front-end uop, and code size (L1i cache footprint). movapd
has zero latency in the back-end, and doesn't need an execution unit, but that's all - Can x86's MOV really be "free"? Why can't I reproduce this at all?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With