Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between GCC builtin vectorization types and C arrays?

I have three functions a(), b() and c() that are supposed to do the same thing:

typedef float Builtin __attribute__ ((vector_size (16)));

typedef struct {
        float values[4];
} Struct;

typedef union {
        Builtin b;
        Struct s;
} Union;

extern void printv(Builtin);
extern void printv(Union);
extern void printv(Struct);

int a() {
        Builtin m = { 1.0, 2.0, 3.0, 4.0 };
        printv(m);
}

int b() {
        Union m = { 1.0, 2.0, 3.0, 4.0 };
        printv(m);
}

int c() {
        Struct m = { 1.0, 2.0, 3.0, 4.0 };
        printv(m);
}

When I compile this code I observe the following behaviour:

  • When calling printv() in a() all 4 floats are being passed by %xmm0. No writes to memory occur.
  • When calling printv() in b() 2 floats are being passed by %xmm0 and the two other floats by %xmm1. To accomplish this 4 floats are loaded (.LC0) to %xmm2 and from there to memory. After that, 2 floats are read from the same place in memory to %xmm0 and the 2 other floats are loaded (.LC1) to %xmm1.
  • I'm a bit lost on what c() actually does.

Why are a(), b() and c() different?

Here is the assembly output for a():

        vmovaps .LC0(%rip), %xmm0
        call    _Z6printvU8__vectorf

The assembly output for b():

        vmovaps .LC0(%rip), %xmm2
        vmovaps %xmm2, (%rsp)
        vmovq   .LC1(%rip), %xmm1
        vmovq   (%rsp), %xmm0
        call    _Z6printv5Union

And the assembly output for c():

         andq    $-32, %rsp
         subq    $32, %rsp
         vmovaps .LC0(%rip), %xmm0
         vmovaps %xmm0, (%rsp)
         vmovq   .LC2(%rip), %xmm0
         vmovq   8(%rsp), %xmm1
         call    _Z6printv6Struct

The data:

        .section        .rodata.cst16,"aM",@progbits,16
        .align 16
.LC0:
        .long   1065353216
        .long   1073741824
        .long   1077936128
        .long   1082130432
        .section        .rodata.cst8,"aM",@progbits,8
        .align 8
.LC1:
        .quad   4647714816524288000
        .align 8
.LC2:
        .quad   4611686019492741120

The quad 4647714816524288000 seems to be nothing more than the floats 3.0 and 4.0 in adjacent long words.

like image 813
Tom De Caluwé Avatar asked May 17 '13 22:05

Tom De Caluwé


1 Answers

Nice question, I had to dig a little because I never used SSE (in this case SSE2) myself. Essentially vector instructions are used to operate on multiple values stored in one register i.e. the XMM(0-7) registers. In C the data type float uses IEEE 754 and its length is thus 32bits. Using four floats will yield a vector of length 128bits which is exactly the length of the XMM(0-7) registers. Now the registers provided by SSE look like this:

SSE (avx-128):                         |----------------|name: XMM0; size: 128bit
SSE (avx-256):        |----------------|----------------|name: YMM0; size: 256bit

In your first case a() you use the SIMD vectorization with

typedef float Builtin __attribute__ ((vector_size (16)));

which allows you to shift the entire vector in one go into the XMM0 register. Now in your second case b() you use a union. But because you do not load .LC0 into the union with Union m.b = { 1.0, 2.0, 3.0, 4.0 }; the data is not recognized as a vectorization. This leads to the following behavior:

The data from .LC0 is loaded into XMM2 with:

 vmovaps .LC0(%rip), %xmm2

but because your data can be interpreted as a structure or as a vectorization the data has to be split up into two 64bit chunks which will still have to be in the XMM(0-7) registers because it can be treated as a vectorization but it has to be maximally 64bit long that it can be transferred to a register (which is only 64bit wide and would overflow if 128bit were transferred to it; data is lost) because the data can also be treated as a structure. This is done in the following.

The vectorization in XMM2 is loaded to memory with

    vmovaps %xmm2, (%rsp)

now the upper 64bits of the vectorization (bits 64-127), i.e. the floats 3.0 and 4.0 are moved (vmovq moves quadword i.e. 64 bits ) to XMM1 with

    vmovq   .LC1(%rip), %xmm1

and finally the lower 64bits of the vectorization (bits 0-63) i.e. the floats 1.0 and 2.0 are moved from memory to XMM0 with

    vmovq   (%rsp), %xmm0

Now you have the upper and the lower part of the 128bit vector in separate XMM(0-7) registers.

Now in case c() I'm not quite sure as well but here it goes. First %rsp is aligned to a 32bit address and then 32 byte are subtracted to store the data on the stack (this will align to a 32bit address again) this is done with

     andq    $-32, %rsp
     subq    $32, %rsp

now this time the vectorization is loaded into XMM0 and then placed on the stack with

     vmovaps .LC0(%rip), %xmm0
     vmovaps %xmm0, (%rsp)

and finally the upper 64bits of the vectorization are stored in XMM0 and the lower 64bits are stored in the XMM1 register with

     vmovq   .LC2(%rip), %xmm0
     vmovq   8(%rsp), %xmm1

In all three cases the vectorization is treated differently. Hope this helps.

like image 166
red-E Avatar answered Oct 15 '22 08:10

red-E