Using AVX intrinsics instead of SSE does not improve speed -- why?

Question

I've been using Intel's SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunately, was not the case until now. Probably I am doing a stupid mistake, so I would be very grateful if somebody could help me out.

I use Ubuntu 11.10 with g++ 4.6.1. I compiled my program (see below) with

g++ simpleExample.cpp -O3 -march=native -o simpleExample

The test system has a Intel i7-2600 CPU.

Here is the code which exemplifies my problem. On my system, I get the output

98.715 ms, b[42] = 0.900038 // Naive
24.457 ms, b[42] = 0.900038 // SSE
24.646 ms, b[42] = 0.900038 // AVX

Note that the computation sqrt(sqrt(sqrt(x))) was only chosen to ensure that memory bandwith does not limit execution speed; it is just an example.

simpleExample.cpp:

#include <immintrin.h>
#include <iostream>
#include <math.h> 
#include <sys/time.h>

using namespace std;

// -----------------------------------------------------------------------------
// This function returns the current time, expressed as seconds since the Epoch
// -----------------------------------------------------------------------------
double getCurrentTime(){
  struct timeval curr;
  struct timezone tz;
  gettimeofday(&curr, &tz);
  double tmp = static_cast<double>(curr.tv_sec) * static_cast<double>(1000000)
             + static_cast<double>(curr.tv_usec);
  return tmp*1e-6;
}

// -----------------------------------------------------------------------------
// Main routine
// -----------------------------------------------------------------------------
int main() {

  srand48(0);            // seed PRNG
  double e,s;            // timestamp variables
  float *a, *b;          // data pointers
  float *pA,*pB;         // work pointer
  __m128 rA,rB;          // variables for SSE
  __m256 rA_AVX, rB_AVX; // variables for AVX

  // define vector size 
  const int vector_size = 10000000;

  // allocate memory 
  a = (float*) _mm_malloc (vector_size*sizeof(float),32);
  b = (float*) _mm_malloc (vector_size*sizeof(float),32);

  // initialize vectors //
  for(int i=0;i<vector_size;i++) {
    a[i]=fabs(drand48());
    b[i]=0.0f;
  }

// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// Naive implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  s = getCurrentTime();
  for (int i=0; i<vector_size; i++){
    b[i] = sqrtf(sqrtf(sqrtf(a[i])));
  }
  e = getCurrentTime();
  cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;

// -----------------------------------------------------------------------------
  for(int i=0;i<vector_size;i++) {
    b[i]=0.0f;
  }
// -----------------------------------------------------------------------------

// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// SSE2 implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  pA = a; pB = b;

  s = getCurrentTime();
  for (int i=0; i<vector_size; i+=4){
    rA   = _mm_load_ps(pA);
    rB   = _mm_sqrt_ps(_mm_sqrt_ps(_mm_sqrt_ps(rA)));
    _mm_store_ps(pB,rB);
    pA += 4;
    pB += 4;
  }
  e = getCurrentTime();
  cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;

// -----------------------------------------------------------------------------
  for(int i=0;i<vector_size;i++) {
    b[i]=0.0f;
  }
// -----------------------------------------------------------------------------

// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// AVX implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  pA = a; pB = b;

  s = getCurrentTime();
  for (int i=0; i<vector_size; i+=8){
    rA_AVX   = _mm256_load_ps(pA);
    rB_AVX   = _mm256_sqrt_ps(_mm256_sqrt_ps(_mm256_sqrt_ps(rA_AVX)));
    _mm256_store_ps(pB,rB_AVX);
    pA += 8;
    pB += 8;
  }
  e = getCurrentTime();
  cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;

  _mm_free(a);
  _mm_free(b);

  return 0;
}

Any help is appreciated!

Norbert P. · Accepted Answer

This is because VSQRTPS (AVX instruction) takes exactly twice as many cycles as SQRTPS (SSE instruction) on a Sandy Bridge processor. See Agner Fog's optimize guide: instruction tables, page 88.

Instructions like square root and division don't benefit from AVX. On the other hand, additions, multiplications, etc., do.

Evgeny Kluev · Answer

If you are interested in increasing square root performance, instead of VSQRTPS you can use VRSQRTPS and Newton-Raphson formula:

x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)

VRSQRTPS itself doesn't benefit from AVX, but other calculations do.

Use it if 23 bits of precision is enough for you.

Salah Saleh · Answer

Just for completeness. The Newton-Raphson (NR) implementation for operations like the division or the square root will only be beneficial if you have a limited number of those operations in your code. This is because if you used these alternative methods you will generate more pressure on other ports such as the multiplication and addition ports. That's basically the reason why x86 architectures have special hardware unit to handle these operation instead of the alternative software solutions (like NR). I quote from Intel 64 and IA-32 Architectures Optimization Reference Manual p.556:

"In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency of these operations, the approximation with Newton-Raphson can slow down execution."

So be careful when using NR in large algorithms. Actually, I had my master's thesis around this point and I will leave a link to it here for future reference, once it is published .

Also for people how always wonder about the throughput and the latency of certain instructions, have a look on IACA. It is a very useful tool provided by Intel to statically analyze the in-core execution performance of codes.

edited here is a link to the thesis for those who are interested thesis

SoapBox · Answer

Depending on your processor hardware, the AVX instructions may be emulated in the hardware as SSE instructions. You'd need to look up your processor's part number to get exact specs on it, but this is one of the main differences between low-end and high-end intel processors, the number of specialize execution units vs. hardware emulation.

Using AVX intrinsics instead of SSE does not improve speed -- why?

Tags:

c++

performance

gcc

avx

sse

user1158218

4 Answers

Norbert P.

Evgeny Kluev

Salah Saleh

SoapBox

Recent Activity

Donate For Us

Using AVX intrinsics instead of SSE does not improve speed -- why?

Tags:

c++

performance

gcc

avx

sse

user1158218

4 Answers

Norbert P.

Evgeny Kluev

Salah Saleh

SoapBox

Related questions

Recent Activity

Donate For Us