I try to understand how vectorization with SSE instructions works.
Here a code snippet where vectorization is achieved :
#include <stdlib.h>
#include <stdio.h>
#define SIZE 10000
void test1(double * restrict a, double * restrict b)
{
int i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] += y[i];
}
}
and my compilation command :
gcc -std=c99 -c example1.c -O3 -S -o example1.s
Here the output for assembler code :
.file "example1.c"
.text
.p2align 4,,15
.globl test1
.type test1, @function
test1:
.LFB7:
.cfi_startproc
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movapd (%rdi,%rax), %xmm0
addpd (%rsi,%rax), %xmm0
movapd %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
rep ret
.cfi_endproc
.LFE7:
.size test1, .-test1
.ident "GCC: (Debian 4.8.2-16) 4.8.2"
.section .note.GNU-stack,"",@progbits
I have practiced Assembler many years ago and I would like to know what represents above the registers %rdi, %rax and %rsi.
I know %xmm0 is the SIMD register where we can store 2 doubles (on 16 bytes).
But I don't understand how the simultaneous addition is performed :
I think all happens here :
movapd (%rdi,%rax), %xmm0
addpd (%rsi,%rax), %xmm0
movapd %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
rep ret
Does %rax represents "x" array ?
What does %rsi represent in C code snippet ?
Does the final result (for example a[0]=a[0]+b[0] is stored into %rdi ?
Thanks for your help
The first thing you need to know is the calling conventions for 64-bit code on Unix systems. See Wikipedia's x86-64_calling_conventions and for much more detail read Agner Fog's calling conventions manual.
Integer parameters are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. So you can pass up six integer values by register (but only four on Windows). This means in your case that:
rdi = &x[0],
rsi = &y[0].
The register rax
starts at zero and increments 2*sizeof(double)=16
bytes each iteration. It is then compared with sizeof(double)*10000=80000
each iteration to test if the loop is finished.
The use of cmp
here is actually an inefficiency in the GCC compiler. Modern Intel processors can fuse the cmp
and jne
instruction into one instruction and they can also fuse add
and jne
into one instruction but they cannot fuse add
, cmp
, and jne
into one instruction. But it's possible to remove the cmp
instruction.
What GCC should have done is set
rdi = &x[0] + 80000;
rsi = &y[0] + 80000;
rax = -80000
and then the loop could be done like this
movapd (%rdi,%rax), %xmm0 ; temp = x[i]
addpd (%rsi,%rax), %xmm0 ; temp += y[i]
movapd %xmm0, (%rdi,%rax) ; x[i] = temp
addq $16, %rax ; i += 2
jnz .L3 ; then loop
Now the loop counts from -80000
up to 0
and does not need the cmp
instruction and the add
and jnz
will be fused into one micro-operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With