Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding of vectorization with SSE instructions

I try to understand how vectorization with SSE instructions works.

Here a code snippet where vectorization is achieved :

#include <stdlib.h>
#include <stdio.h>

#define SIZE 10000

void test1(double * restrict a, double * restrict b)
{
  int i;

  double *x = __builtin_assume_aligned(a, 16);
  double *y = __builtin_assume_aligned(b, 16);

  for (i = 0; i < SIZE; i++)
  {
    x[i] += y[i];
  }
}

and my compilation command :

gcc -std=c99 -c example1.c -O3 -S -o example1.s

Here the output for assembler code :

 .file "example1.c"
  .text
  .p2align 4,,15
  .globl  test1
  .type test1, @function
test1:
.LFB7:
  .cfi_startproc
  xorl  %eax, %eax
  .p2align 4,,10
  .p2align 3
.L3:
  movapd  (%rdi,%rax), %xmm0
  addpd (%rsi,%rax), %xmm0
  movapd  %xmm0, (%rdi,%rax)
  addq  $16, %rax
  cmpq  $80000, %rax
  jne .L3
  rep ret
  .cfi_endproc
.LFE7:
  .size test1, .-test1
  .ident  "GCC: (Debian 4.8.2-16) 4.8.2"
  .section  .note.GNU-stack,"",@progbits

I have practiced Assembler many years ago and I would like to know what represents above the registers %rdi, %rax and %rsi.

I know %xmm0 is the SIMD register where we can store 2 doubles (on 16 bytes).

But I don't understand how the simultaneous addition is performed :

I think all happens here :

      movapd  (%rdi,%rax), %xmm0
      addpd (%rsi,%rax), %xmm0
      movapd  %xmm0, (%rdi,%rax)
      addq  $16, %rax
      cmpq  $80000, %rax
      jne .L3
      rep ret

Does %rax represents "x" array ?

What does %rsi represent in C code snippet ?

Does the final result (for example a[0]=a[0]+b[0] is stored into %rdi ?

Thanks for your help

like image 537
youpilat13 Avatar asked Feb 12 '23 19:02

youpilat13


1 Answers

The first thing you need to know is the calling conventions for 64-bit code on Unix systems. See Wikipedia's x86-64_calling_conventions and for much more detail read Agner Fog's calling conventions manual.

Integer parameters are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. So you can pass up six integer values by register (but only four on Windows). This means in your case that:

rdi = &x[0],
rsi = &y[0].

The register rax starts at zero and increments 2*sizeof(double)=16 bytes each iteration. It is then compared with sizeof(double)*10000=80000 each iteration to test if the loop is finished.

The use of cmp here is actually an inefficiency in the GCC compiler. Modern Intel processors can fuse the cmp and jne instruction into one instruction and they can also fuse add and jne into one instruction but they cannot fuse add, cmp, and jne into one instruction. But it's possible to remove the cmp instruction.

What GCC should have done is set

rdi = &x[0] + 80000;
rsi = &y[0] + 80000;
rax = -80000

and then the loop could be done like this

movapd  (%rdi,%rax), %xmm0       ; temp = x[i]
addpd (%rsi,%rax), %xmm0         ; temp += y[i]
movapd  %xmm0, (%rdi,%rax)       ; x[i] = temp
addq  $16, %rax                  ; i += 2
jnz .L3                          ; then loop

Now the loop counts from -80000 up to 0 and does not need the cmp instruction and the add and jnz will be fused into one micro-operation.

like image 51
Z boson Avatar answered Feb 16 '23 03:02

Z boson