I have three functions a()
, b()
and c()
that are supposed to do the same thing:
typedef float Builtin __attribute__ ((vector_size (16)));
typedef struct {
float values[4];
} Struct;
typedef union {
Builtin b;
Struct s;
} Union;
extern void printv(Builtin);
extern void printv(Union);
extern void printv(Struct);
int a() {
Builtin m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
int b() {
Union m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
int c() {
Struct m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
When I compile this code I observe the following behaviour:
printv()
in a()
all 4 floats are being passed by %xmm0
. No writes to memory occur.printv()
in b()
2 floats are being passed by %xmm0
and the two other floats by %xmm1
. To accomplish this 4 floats are loaded (.LC0) to %xmm2
and from there to memory. After that, 2 floats are read from the same place in memory to %xmm0
and the 2 other floats are loaded (.LC1) to %xmm1
.c()
actually does.Why are a()
, b()
and c()
different?
Here is the assembly output for a():
vmovaps .LC0(%rip), %xmm0
call _Z6printvU8__vectorf
The assembly output for b():
vmovaps .LC0(%rip), %xmm2
vmovaps %xmm2, (%rsp)
vmovq .LC1(%rip), %xmm1
vmovq (%rsp), %xmm0
call _Z6printv5Union
And the assembly output for c():
andq $-32, %rsp
subq $32, %rsp
vmovaps .LC0(%rip), %xmm0
vmovaps %xmm0, (%rsp)
vmovq .LC2(%rip), %xmm0
vmovq 8(%rsp), %xmm1
call _Z6printv6Struct
The data:
.section .rodata.cst16,"aM",@progbits,16
.align 16
.LC0:
.long 1065353216
.long 1073741824
.long 1077936128
.long 1082130432
.section .rodata.cst8,"aM",@progbits,8
.align 8
.LC1:
.quad 4647714816524288000
.align 8
.LC2:
.quad 4611686019492741120
The quad 4647714816524288000
seems to be nothing more than the floats 3.0
and 4.0
in adjacent long words.
Nice question, I had to dig a little because I never used SSE (in this case SSE2) myself. Essentially vector instructions are used to operate on multiple values stored in one register i.e. the XMM(0-7) registers. In C the data type float uses IEEE 754 and its length is thus 32bits. Using four floats will yield a vector of length 128bits which is exactly the length of the XMM(0-7) registers. Now the registers provided by SSE look like this:
SSE (avx-128): |----------------|name: XMM0; size: 128bit
SSE (avx-256): |----------------|----------------|name: YMM0; size: 256bit
In your first case a()
you use the SIMD vectorization with
typedef float Builtin __attribute__ ((vector_size (16)));
which allows you to shift the entire vector in one go into the XMM0 register. Now in your second case b()
you use a union. But because you do not load .LC0 into the union with Union m.b = { 1.0, 2.0, 3.0, 4.0 };
the data is not recognized as a vectorization. This leads to the following behavior:
The data from .LC0 is loaded into XMM2 with:
vmovaps .LC0(%rip), %xmm2
but because your data can be interpreted as a structure or as a vectorization the data has to be split up into two 64bit chunks which will still have to be in the XMM(0-7) registers because it can be treated as a vectorization but it has to be maximally 64bit long that it can be transferred to a register (which is only 64bit wide and would overflow if 128bit were transferred to it; data is lost) because the data can also be treated as a structure. This is done in the following.
The vectorization in XMM2 is loaded to memory with
vmovaps %xmm2, (%rsp)
now the upper 64bits of the vectorization (bits 64-127), i.e. the floats 3.0
and 4.0
are moved (vmovq moves quadword i.e. 64 bits ) to XMM1 with
vmovq .LC1(%rip), %xmm1
and finally the lower 64bits of the vectorization (bits 0-63) i.e. the floats 1.0
and 2.0
are moved from memory to XMM0 with
vmovq (%rsp), %xmm0
Now you have the upper and the lower part of the 128bit vector in separate XMM(0-7) registers.
Now in case c()
I'm not quite sure as well but here it goes. First %rsp is aligned to a 32bit address and then 32 byte are subtracted to store the data on the stack (this will align to a 32bit address again) this is done with
andq $-32, %rsp
subq $32, %rsp
now this time the vectorization is loaded into XMM0 and then placed on the stack with
vmovaps .LC0(%rip), %xmm0
vmovaps %xmm0, (%rsp)
and finally the upper 64bits of the vectorization are stored in XMM0 and the lower 64bits are stored in the XMM1 register with
vmovq .LC2(%rip), %xmm0
vmovq 8(%rsp), %xmm1
In all three cases the vectorization is treated differently. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With