Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE SIMD Optimization For Loop

I have some code in a loop

for(int i = 0; i < n; i++)
{
  u[i] = c * u[i] + s * b[i];
}

So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?

UPDATE

I learnt vectorization (turns out it's not so hard if you use intrinsics) and implemented my loop in SSE. However, when setting the SSE2 flag in the VC++ compiler, I get about the same performance as with my own SSE code. The Intel compiler on the other hand was much faster than my SSE code or the VC++ compiler.

Here is the code I wrote for reference

double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
   u[i] = 0;
}

int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
  __m128d uStore = _mm_set_pd(u[j+1], u[j]);

  __m128d cu = _mm_mul_pd(cStore, uStore);
  __m128d so = _mm_mul_pd(sStore, omegaStore);

  uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
  u[j] = c * u[j] + s * omegaCache[j];
}
like image 932
Projectile Fish Avatar asked May 27 '10 03:05

Projectile Fish


People also ask

What is SIMD optimization?

NumPy comes with a flexible working mechanism that allows it to harness the SIMD features that CPUs own, in order to provide faster and more stable performance on all popular platforms. Currently, NumPy supports the X86, IBM/Power, ARM7 and ARM8 architectures.

Does GCC use SIMD?

The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.

Does C++ use SIMD?

One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.


1 Answers

Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:

int i;
for(i = 0; i < n - 3; i += 4)
{
  load elements u[i,i+1,i+2,i+3]
  load elements b[i,i+1,i+2,i+3]
  vector multiply u * c
  vector multiply s * b
  add partial results
  store back to u[i,i+1,i+2,i+3]
}

// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
  u[i] = c * u[i] + s * b[i];

For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.

like image 153
Adam Rosenfield Avatar answered Sep 28 '22 02:09

Adam Rosenfield