Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't gcc vectorize this straight-line code?

I have the following code, which seems like a perfect candidate for SLP:

struct vector {
  double x, y, z;
} __attribute__((aligned(16)));

int
slp_test(struct vector *x0, struct vector *n)
{
  double t = -x0->z/n->z;
  double u = x0->x + t*n->x;
  double v = x0->y + t*n->y;
  return t >= 0.0 && u >= 0.0 && v >= 0.0 && u + v <= 1.0;
}

The computations of u and v seem easily vectorizable, and x0 and n ought to be aligned well enough for it. But on x86-64 at -O3, gcc 4.9.0 generates this code:

    movsd   .LC0(%rip), %xmm1
    movsd   16(%rdi), %xmm0
    movsd   (%rdi), %xmm2
    xorpd   %xmm1, %xmm0
    movsd   (%rsi), %xmm1
    pxor    %xmm3, %xmm3
    divsd   16(%rsi), %xmm0 ; t = x0->z/n->z
    mulsd   %xmm0, %xmm1    ; t*n->x
    addsd   %xmm1, %xmm2    ; u = x0->x + t*n->x
    movsd   8(%rsi), %xmm1
    mulsd   %xmm0, %xmm1    ; t*n->y
    ucomisd %xmm3, %xmm2
    addsd   8(%rdi), %xmm1  ; v = x0->y + t*n->y
    setae   %dl
    ucomisd %xmm3, %xmm1
    setae   %al
    testb   %al, %dl
    je      .L3
    ucomisd %xmm3, %xmm0
    jb      .L3
    addsd   %xmm2, %xmm1
    movsd   .LC2(%rip), %xmm0
    xorl    %eax, %eax
    ucomisd %xmm1, %xmm0
    setae   %al
    ret
.L3:
    xorl    %eax, %eax
    ret

How come gcc doesn't use mulpd addpd instead of the two mulsds and addsds? I used -fopt-info-all-vec to try to see why, and it complains about alignment (full output):

slp-test.c:8:17: note: === vect_analyze_data_refs_alignment ===
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref x0_3(D)->z
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref n_6(D)->z
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref x0_3(D)->x
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref n_6(D)->x
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 8 bytes of ref x0_3(D)->y
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 8 bytes of ref n_6(D)->y
slp-test.c:8:17: note: === vect_analyze_slp ===
slp-test.c:8:17: note: Failed to SLP the basic block.
slp-test.c:8:17: note: not vectorized: failed to find SLP opportunities in basic block.

Unless I've misinterpreted __attribute__((aligned(16))), it should be able to force alignment for those accesses. Any ideas?

like image 795
Tavian Barnes Avatar asked Jul 10 '14 16:07

Tavian Barnes


People also ask

What is GCC vectorization?

GCC Autovectorization flagsGCC is an advanced compiler, and with the optimization flags -O3 or -ftree-vectorize the compiler will search for loop vectorizations (remember to specify the -mavx flag too). The source code remains the same, but the compiled code by GCC is completely different.

What does it mean to vectorize your code?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

What does it mean to vectorize a loop?

• Loop vectorization transforms a program so that the. same operation is performed at the same time on several. vector elements. for (i=0; i<n; i++) c[i] = a[i] + b[i];


1 Answers

this code will not profit from vectorization much, keep in mind that cpus are able to execute multiple instructions in a single cycle. E.g on a Nehalem multiplication/addition has a latency of 4 and a reciprocal throughput of 1, so it should be able to compute 4 of these instructions in four cycles in the ideal case. Here at least 2 should be possible. This already means you would gain nothing by using packed instruction even if the vector registers were already filed perfectly.

EDIT: I did not realize the data can be loaded in one move, so the setup cost might be negligible

In order to fill them you probably already need a high and low mov instruction which will cost you more than the few packed instruction will gain you later. (On Nehalem mov[hl]pd have a latency of ~5 while movsd has a latency of 2)

the comparison cannot be vectorized profitably either as you would need to unpack the packed comparison back into a normal register which is a very expensive operation. Also the compiler does not know the probabilities of the branches, it must assume the first comparison will always short circuit the rest so doing anything in parallel will be harmful.

EDIT: with sse4 ptest it could be profitable though

The bottleneck here is also likely going to be the unvectorizable division. You are probably better of trying to vectorize the operations of 2 structures at once instead of trying to vectorize the operations in one structure.

like image 58
jtaylor Avatar answered Oct 19 '22 04:10

jtaylor