What's the difference between GCC builtin vectorization types and C arrays?

I have three functions a(), b() and c() that are supposed to do the same thing:

typedef float Builtin __attribute__ ((vector_size (16)));

typedef struct {
        float values[4];
} Struct;

typedef union {
        Builtin b;
        Struct s;
} Union;

extern void printv(Builtin);
extern void printv(Union);
extern void printv(Struct);

int a() {
        Builtin m = { 1.0, 2.0, 3.0, 4.0 };

int b() {
        Union m = { 1.0, 2.0, 3.0, 4.0 };

int c() {
        Struct m = { 1.0, 2.0, 3.0, 4.0 };

When I compile this code I observe the following behaviour:

  • When calling printv() in a() all 4 floats are being passed by %xmm0. No writes to memory occur.
  • When calling printv() in b() 2 floats are being passed by %xmm0 and the two other floats by %xmm1. To accomplish this 4 floats are loaded (.LC0) to %xmm2 and from there to memory. After that, 2 floats are read from the same place in memory to %xmm0 and the 2 other floats are loaded (.LC1) to %xmm1.
  • I'm a bit lost on what c() actually does.

Why are a(), b() and c() different?

Here is the assembly output for a():

        vmovaps .LC0(%rip), %xmm0
        call    _Z6printvU8__vectorf

The assembly output for b():

        vmovaps .LC0(%rip), %xmm2
        vmovaps %xmm2, (%rsp)
        vmovq   .LC1(%rip), %xmm1
        vmovq   (%rsp), %xmm0
        call    _Z6printv5Union

And the assembly output for c():

         andq    $-32, %rsp
         subq    $32, %rsp
         vmovaps .LC0(%rip), %xmm0
         vmovaps %xmm0, (%rsp)
         vmovq   .LC2(%rip), %xmm0
         vmovq   8(%rsp), %xmm1
         call    _Z6printv6Struct

The data:

        .section        .rodata.cst16,"aM",@progbits,16
        .align 16
        .long   1065353216
        .long   1073741824
        .long   1077936128
        .long   1082130432
        .section        .rodata.cst8,"aM",@progbits,8
        .align 8
        .quad   4647714816524288000
        .align 8
        .quad   4611686019492741120

The quad 4647714816524288000 seems to be nothing more than the floats 3.0 and 4.0 in adjacent long words.

Nice question, I had to dig a little because I never used SSE (in this case SSE2) myself. Essentially vector instructions are used to operate on multiple values stored in one register i.e. the XMM(0-7) registers. In C the data type float uses IEEE 754 and its length is thus 32bits. Using four floats will yield a vector of length 128bits which is exactly the length of the XMM(0-7) registers. Now the registers provided by SSE look like this:

SSE (avx-128):                         |----------------|name: XMM0; size: 128bit
SSE (avx-256):        |----------------|----------------|name: YMM0; size: 256bit

In your first case a() you use the SIMD vectorization with

typedef float Builtin __attribute__ ((vector_size (16)));

which allows you to shift the entire vector in one go into the XMM0 register. Now in your second case b() you use a union. But because you do not load .LC0 into the union with Union m.b = { 1.0, 2.0, 3.0, 4.0 }; the data is not recognized as a vectorization. This leads to the following behavior:

The data from .LC0 is loaded into XMM2 with:

 vmovaps .LC0(%rip), %xmm2

but because your data can be interpreted as a structure or as a vectorization the data has to be split up into two 64bit chunks which will still have to be in the XMM(0-7) registers because it can be treated as a vectorization but it has to be maximally 64bit long that it can be transferred to a register (which is only 64bit wide and would overflow if 128bit were transferred to it; data is lost) because the data can also be treated as a structure. This is done in the following.

The vectorization in XMM2 is loaded to memory with

    vmovaps %xmm2, (%rsp)

now the upper 64bits of the vectorization (bits 64-127), i.e. the floats 3.0 and 4.0 are moved (vmovq moves quadword i.e. 64 bits ) to XMM1 with

    vmovq   .LC1(%rip), %xmm1

and finally the lower 64bits of the vectorization (bits 0-63) i.e. the floats 1.0 and 2.0 are moved from memory to XMM0 with

    vmovq   (%rsp), %xmm0

Now you have the upper and the lower part of the 128bit vector in separate XMM(0-7) registers.

Now in case c() I'm not quite sure as well but here it goes. First %rsp is aligned to a 32bit address and then 32 byte are subtracted to store the data on the stack (this will align to a 32bit address again) this is done with

     andq    $-32, %rsp
     subq    $32, %rsp

now this time the vectorization is loaded into XMM0 and then placed on the stack with

     vmovaps .LC0(%rip), %xmm0
     vmovaps %xmm0, (%rsp)

and finally the upper 64bits of the vectorization are stored in XMM0 and the lower 64bits are stored in the XMM1 register with

     vmovq   .LC2(%rip), %xmm0
     vmovq   8(%rsp), %xmm1

In all three cases the vectorization is treated differently. Hope this helps.

