This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed. I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below: <pre class="prettyprint"><code>// SSE VERSION #define N 10000 #define NTIMES 100000 #include <time.h> #include <stdio.h> #include <xmmintrin.h> #include <pmmintrin.h> double a[N] __attribute__((aligned(16))); double b[N] __attribute__((aligned(16))); double c[N] __attribute__((aligned(16))); double r[N] __attribute__((aligned(16))); int main(void){ int i, times; for( times = 0; times < NTIMES; times++ ){ for( i = 0; i <N; i+= 2){ __m128d mm_a = _mm_load_pd( &a[i] ); _mm_prefetch( &a[i+4], _MM_HINT_T0 ); __m128d mm_b = _mm_load_pd( &b[i] ); _mm_prefetch( &b[i+4] , _MM_HINT_T0 ); __m128d mm_c = _mm_load_pd( &c[i] ); _mm_prefetch( &c[i+4] , _MM_HINT_T0 ); __m128d mm_r; mm_r = _mm_add_pd( mm_a, mm_b ); mm_a = _mm_mul_pd( mm_r , mm_c ); _mm_store_pd( &r[i], mm_a ); } } } //NO SSE VERSION //same definitions as before int main(void){ int i, times; for( times = 0; times < NTIMES; times++ ){ for( i = 0; i < N; i++ ){ r[i] = (a[i]+b[i])*c[i]; } } } </code></pre> When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated. Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

Vectorization in GCC is enabled at <code>-O3</code>. That's why at <code>-O0</code>, you see only the ordinary scalar SSE2 instructions (<code>movsd</code>, <code>addsd</code>, etc). Using GCC 4.6.1 and your second example: <pre class="prettyprint"><code>#define N 10000 #define NTIMES 100000 double a[N] __attribute__ ((aligned (16))); double b[N] __attribute__ ((aligned (16))); double c[N] __attribute__ ((aligned (16))); double r[N] __attribute__ ((aligned (16))); int main (void) { int i, times; for (times = 0; times < NTIMES; times++) { for (i = 0; i < N; ++i) r[i] = (a[i] + b[i]) * c[i]; } return 0; } </code></pre> and compiling with <code>gcc -S -O3 -msse2 sse.c</code> produces for the inner loop the following instructions, which is pretty good: <pre class="prettyprint"><code>.L3: movapd a(%eax), %xmm0 addpd b(%eax), %xmm0 mulpd c(%eax), %xmm0 movapd %xmm0, r(%eax) addl $16, %eax cmpl $80000, %eax jne .L3 </code></pre> As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with <code>gcc -S -O3 -msse2 -mavx sse.c</code> gives for the inner loop: <pre class="prettyprint"><code>.L3: vmovapd a(%eax), %ymm0 vaddpd b(%eax), %ymm0, %ymm0 vmulpd c(%eax), %ymm0, %ymm0 vmovapd %ymm0, r(%eax) addl $32, %eax cmpl $80000, %eax jne .L3 </code></pre> Note that <code>v</code> in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards. Just replace the inner loop in chill's sample code with: <pre class="prettyprint"><code>for (i = N-1; i >= 0; --i) r[i] = (a[i] + b[i]) * c[i]; </code></pre> GCC (4.8.4) with options <code>-S -O3 -mavx</code> produces: <pre class="prettyprint"><code>.L5: vmovsd a+79992(%rax), %xmm0 subq $8, %rax vaddsd b+80000(%rax), %xmm0, %xmm0 vmulsd c+80000(%rax), %xmm0, %xmm0 vmovsd %xmm0, r+80000(%rax) cmpq $-80000, %rax jne .L5 </code></pre>

GCC SSE code optimization

Tags:

c

optimization

compiler-optimization

sse

hpc

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.

I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:

// SSE VERSION

#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>

double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));

int main(void){
  int i, times;
  for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i <N; i+= 2){ 
        __m128d mm_a = _mm_load_pd( &a[i] );  
        _mm_prefetch( &a[i+4], _MM_HINT_T0 );
        __m128d mm_b = _mm_load_pd( &b[i] );  
        _mm_prefetch( &b[i+4] , _MM_HINT_T0 );
        __m128d mm_c = _mm_load_pd( &c[i] );
        _mm_prefetch( &c[i+4] , _MM_HINT_T0 );
        __m128d mm_r;
        mm_r = _mm_add_pd( mm_a, mm_b );
        mm_a = _mm_mul_pd( mm_r , mm_c );
        _mm_store_pd( &r[i], mm_a );
      }   
   }
 }

//NO SSE VERSION
//same definitions as before
int main(void){
  int i, times;
   for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i < N; i++ ){
      r[i] = (a[i]+b[i])*c[i];
    }   
  }
}

When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.

Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

608

asked Oct 27 '11 16:10

Genís

2 Answers

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

#define N 10000
#define NTIMES 100000

double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));

int
main (void)
{
  int i, times;
  for (times = 0; times < NTIMES; times++)
    {
      for (i = 0; i < N; ++i)
        r[i] = (a[i] + b[i]) * c[i];
    }

  return 0;
}

and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:

.L3:
    movapd  a(%eax), %xmm0
    addpd   b(%eax), %xmm0
    mulpd   c(%eax), %xmm0
    movapd  %xmm0, r(%eax)
    addl    $16, %eax
    cmpl    $80000, %eax
    jne .L3

As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:

.L3:
    vmovapd a(%eax), %ymm0
    vaddpd  b(%eax), %ymm0, %ymm0
    vmulpd  c(%eax), %ymm0, %ymm0
    vmovapd %ymm0, r(%eax)
    addl    $32, %eax
    cmpl    $80000, %eax
    jne .L3

Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

196

answered Oct 19 '22 18:10

chill

I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.

Just replace the inner loop in chill's sample code with:

for (i = N-1; i >= 0; --i)
    r[i] = (a[i] + b[i]) * c[i];

GCC (4.8.4) with options -S -O3 -mavx produces:

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    $8, %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5

answered Oct 19 '22 19:10

Luca Citi

Related questions
                            
                                error C2733 second C linkage of overloaded function 'function' not allowed
                            
                                _mm_load_ps vs. _mm_load_pd vs. etc on Intel x86 ISA
                            
                                Library for gradient boosting tree
                            
                                Read and Write atomic operation implementation in the Linux Kernel
                            
                                Arithmetic vs boolean operations
                            
                                C: Passing variable number of arguments from one function to another
                            
                                How to do floating point calculations with integers
                            
                                What is a misaligned pointer ?
                            
                                Converting a non-`void` pointer to `uintptr_t` and vice-versa
                            
                                Why are compound literals in C modifiable
                            
                                On which platforms does integer divide by zero trigger a floating point exception?
                            
                                Sandboxing in Linux
                            
                                portable way to deal with 64/32 bit time_t
                            
                                Is it possible to build a interactive C shell?
                            
                                Would making plain int 64-bit break a lot of reasonable code?
                            
                                How do I get tcsetpgrp() to work in C?
                            
                                Getting Segmentation Fault
                            
                                How to write a hello world kernel?
                            
                                Is it good programming practice to use setjmp and longjmp in C?
                            
                                Valgrind complains with "Invalid write of size 8"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With