Why can't gcc vectorize this straight-line code?

Tags:

I have the following code, which seems like a perfect candidate for SLP:

struct vector {
  double x, y, z;
} __attribute__((aligned(16)));

int
slp_test(struct vector *x0, struct vector *n)
{
  double t = -x0->z/n->z;
  double u = x0->x + t*n->x;
  double v = x0->y + t*n->y;
  return t >= 0.0 && u >= 0.0 && v >= 0.0 && u + v <= 1.0;
}

The computations of u and v seem easily vectorizable, and x0 and n ought to be aligned well enough for it. But on x86-64 at -O3, gcc 4.9.0 generates this code:

    movsd   .LC0(%rip), %xmm1
    movsd   16(%rdi), %xmm0
    movsd   (%rdi), %xmm2
    xorpd   %xmm1, %xmm0
    movsd   (%rsi), %xmm1
    pxor    %xmm3, %xmm3
    divsd   16(%rsi), %xmm0 ; t = x0->z/n->z
    mulsd   %xmm0, %xmm1    ; t*n->x
    addsd   %xmm1, %xmm2    ; u = x0->x + t*n->x
    movsd   8(%rsi), %xmm1
    mulsd   %xmm0, %xmm1    ; t*n->y
    ucomisd %xmm3, %xmm2
    addsd   8(%rdi), %xmm1  ; v = x0->y + t*n->y
    setae   %dl
    ucomisd %xmm3, %xmm1
    setae   %al
    testb   %al, %dl
    je      .L3
    ucomisd %xmm3, %xmm0
    jb      .L3
    addsd   %xmm2, %xmm1
    movsd   .LC2(%rip), %xmm0
    xorl    %eax, %eax
    ucomisd %xmm1, %xmm0
    setae   %al
    ret
.L3:
    xorl    %eax, %eax
    ret

How come gcc doesn't use mulpd addpd instead of the two mulsds and addsds? I used -fopt-info-all-vec to try to see why, and it complains about alignment (full output):

slp-test.c:8:17: note: === vect_analyze_data_refs_alignment ===
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref x0_3(D)->z
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref n_6(D)->z
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref x0_3(D)->x
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 0 bytes of ref n_6(D)->x
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 8 bytes of ref x0_3(D)->y
slp-test.c:8:17: note: vect_compute_data_ref_alignment:
slp-test.c:8:17: note: misalign = 8 bytes of ref n_6(D)->y
slp-test.c:8:17: note: === vect_analyze_slp ===
slp-test.c:8:17: note: Failed to SLP the basic block.
slp-test.c:8:17: note: not vectorized: failed to find SLP opportunities in basic block.

Unless I've misinterpreted __attribute__((aligned(16))), it should be able to force alignment for those accesses. Any ideas?

795

asked Jul 10 '14 16:07

Tavian Barnes

1 Answers

this code will not profit from vectorization much, keep in mind that cpus are able to execute multiple instructions in a single cycle. E.g on a Nehalem multiplication/addition has a latency of 4 and a reciprocal throughput of 1, so it should be able to compute 4 of these instructions in four cycles in the ideal case. Here at least 2 should be possible. This already means you would gain nothing by using packed instruction even if the vector registers were already filed perfectly.

EDIT: I did not realize the data can be loaded in one move, so the setup cost might be negligible

In order to fill them you probably already need a high and low mov instruction which will cost you more than the few packed instruction will gain you later. (On Nehalem mov[hl]pd have a latency of ~5 while movsd has a latency of 2)

the comparison cannot be vectorized profitably either as you would need to unpack the packed comparison back into a normal register which is a very expensive operation. Also the compiler does not know the probabilities of the branches, it must assume the first comparison will always short circuit the rest so doing anything in parallel will be harmful.

EDIT: with sse4 ptest it could be profitable though

The bottleneck here is also likely going to be the unvectorizable division. You are probably better of trying to vectorize the operations of 2 structures at once instead of trying to vectorize the operations in one structure.

answered Oct 19 '22 04:10

jtaylor

Related questions
                            
                                Why not use GCC constructor/destructor?
                            
                                Incremental hashing in PHP vs C
                            
                                Pause execution until specific time in C
                            
                                C switch statement with non-compound statement use case? [duplicate]
                            
                                lack of speedup and erroneous results with OpenMP and Cython
                            
                                How to correctly check the size passed to malloc() calloc() realloc() as well as their return value?
                            
                                Best fit vs segregated fit vs buddy system for least fragmentation
                            
                                Unix domain sockets 100x slower on Solaris 10 than on Linux?
                            
                                Enforce Static Storage at Compile-time
                            
                                Difficulties with calling an Android NDK function from directly Delphi
                            
                                Difference between C and C++ static initialization [duplicate]
                            
                                Loading a dynamic C shared library using JNI which also loads another shared library
                            
                                Hiredis timeout with large number of concurrent requests
                            
                                Flow control in GDB based on local variable existence
                            
                                Best gcc optimization switches for hyperthreading
                            
                                Using select() with STDIN on windows? [duplicate]
                            
                                Marshalling C array in C# - Simple HelloWorld
                            
                                Indexing an array with a string (C)
                            
                                What are watch descriptors really ? (Linux inotify subsystem)
                            
                                Handling error codes from multiple libraries in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why can't gcc vectorize this straight-line code?

Tags:

c

vectorization

gcc

Tavian Barnes

People also ask

1 Answers

jtaylor

Recent Activity

Donate For Us