Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why this unnecessary MOVAPD copy in gcc 9.1, in a tiny function

Consider the following code:

double x(double a,double b) {
    return a*(float)b;
}

It does a conversion form double to float than again to double and multiplies.

When I compile it with gcc 9.1 with -O3 on x86/64 I get:

x(double, double):
        movapd  xmm2, xmm0
        pxor    xmm0, xmm0
        cvtsd2ss        xmm1, xmm1
        cvtss2sd        xmm0, xmm1
        mulsd   xmm0, xmm2
        ret

With clang and older versions of gcc I get this:

x(double, double):
        cvtsd2ss        xmm1, xmm1
        cvtss2sd        xmm1, xmm1
        mulsd   xmm0, xmm1
        ret

Here I do not copy xmm0 into xmm2 which seems unnecessary to me.

With gcc 9.1 and -Os I get:

x(double, double):
        movapd  xmm2, xmm0
        cvtsd2ss        xmm1, xmm1
        cvtss2sd        xmm0, xmm1
        mulsd   xmm0, xmm2
        ret

So it just removes the instruction which sets xmm0 to zero but not the moveapd.

I believe all three versions are correct, so could there be a performance benefit from the gcc 9.1 -O3 version? And if yes why? Does the pxor xmm0, xmm0 instruction has any benefit?

The issue is similar to Assembly code redundancy in optimized C code, but I don't think its the same because older versions of gcc do not generate the unnecessary copy.

like image 248
Unlikus Avatar asked Jan 24 '23 20:01

Unlikus


1 Answers

This is a GCC missed optimization; this is unfortunately not rare for GCC in tiny functions when its register allocator does a poor job with hard-register constraints imposed by the calling convention; apparently GCC is not usually dumb like this between parts of larger functions.

The pxor-zeroing is there to break the (false) output dependency of cvtss2sd, which exists because of Intel's short-sighted design for single-source scalar instructions to leave the upper part of the destination vector unmodified. They started this with SSE1 for PIII, where it gave a short-term gain because PIII handled XMM regs as two 64-bit halves, so only writing one half let instructions like sqrtss be single-uop.

But they unfortunately kept this pattern even for SSE2 (new with Pentium 4). And later declined to fix it with the AVX version of SSE instructions. So compilers are stuck choosing between the risks of creating a long loop-carried dependency chain through a false dependency, or of using pxor-zeroing. GCC conservatively always uses pxor at -O3, omitting it at -Os. (2-source operations like mulsd already depend on the destination as an input so this is unnecessary).

In this case, with its poor choice of register allocation, leaving out pxor-zeroing would mean that converting (float)b back to double couldn't start until a was ready. So if the critical path was a being ready (b ready early), omitting it would increase the latency from a->result by 5 cycles on Skylake (for the 2-uop cvtss2sd to run only after a was ready, because the output has to merge into the register that originally held a.) Otherwise it's just the mulsd that has to wait for a, with all the stuff involving b done ahead of time.

foo same,same is another way to work around an output dependency; that's what clang is doing. (And what GCC tries to do for popcnt, which unexpectedly has one on Sandybridge-family that's not architecturally required, unlike these stupid SSE ones.)

BTW, AVX 3-operand instructions do sometimes provide a way to work around the false dependencies, using a "cold" register, or one that was xor-zeroed, as the register to merge into. Including for scalar int->FP, although clang sometimes just uses movd plus packed-conversion for that.

Related: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? (I should have just linked that, I forgot I already wrote this up in that much detail on Stack Overflow recently.)


The movapd and pxor zeroing don't cost any latency on modern CPUs, but nothing is ever free. They still cost a front-end uop, and code size (L1i cache footprint). movapd has zero latency in the back-end, and doesn't need an execution unit, but that's all - Can x86's MOV really be "free"? Why can't I reproduce this at all?

like image 57
Peter Cordes Avatar answered Jan 29 '23 12:01

Peter Cordes