I have some code in a loop
for(int i = 0; i < n; i++)
{
u[i] = c * u[i] + s * b[i];
}
So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?
UPDATE
I learnt vectorization (turns out it's not so hard if you use intrinsics) and implemented my loop in SSE. However, when setting the SSE2 flag in the VC++ compiler, I get about the same performance as with my own SSE code. The Intel compiler on the other hand was much faster than my SSE code or the VC++ compiler.
Here is the code I wrote for reference
double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
u[i] = 0;
}
int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
__m128d uStore = _mm_set_pd(u[j+1], u[j]);
__m128d cu = _mm_mul_pd(cStore, uStore);
__m128d so = _mm_mul_pd(sStore, omegaStore);
uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
u[j] = c * u[j] + s * omegaCache[j];
}
NumPy comes with a flexible working mechanism that allows it to harness the SIMD features that CPUs own, in order to provide faster and more stable performance on all popular platforms. Currently, NumPy supports the X86, IBM/Power, ARM7 and ARM8 architectures.
The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.
One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.
Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:
int i;
for(i = 0; i < n - 3; i += 4)
{
load elements u[i,i+1,i+2,i+3]
load elements b[i,i+1,i+2,i+3]
vector multiply u * c
vector multiply s * b
add partial results
store back to u[i,i+1,i+2,i+3]
}
// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
u[i] = c * u[i] + s * b[i];
For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With