I have been struggling with vectorizing a particular application for sometime now and I have tried everything. From autovectorization, to handcoded SSE intrinsics. But somehow I am unable to obtain speedup on my stencil based application.
Following is a snippet of my current code, which I have vectorized using SSE intrinsics. When I compile (Intel icc) it using -vec-report3 I constantly obtain this message:
remark: loop was not vectorized: statement cannot be vectorized.
#pragma ivdep
for ( i = STENCIL; i < z - STENCIL; i+=4 )
{
it = it2 + i;
__m128 tmp2i = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k])),X4_i); //loop was not vectorized: statement cannot be vectorized
__m128 tmp3 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k])),X3_i);
__m128 tmp4 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k])),X2_i);
__m128 tmp5 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j +k*it_k])),X1_i);
__m128 tmp6 = _mm_add_ps(_mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5)), _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i));
_mm_store_ps(&tmp2[i],tmp6);
}
Am I missing something crucial? Since the message doesnt elaborate as to why it cannot be vectorized, I am finding it difficult to ascertain the bottleneck.
UPDATE: After careful consideration of the suggestions, I tweaked the code the following way. I thought it best to break it down further, to identify the statements that actually are responsible for the vector dependence.
//#pragma ivdep
for ( i = STENCIL; i < z - STENCIL; i+=4 )
{
it = it2 + i;
__m128 center = _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i);
u_j4 = _mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]); //Line 180
u_j3 = _mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]);
u_j2 = _mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]);
u_j1 = _mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]);
u_j8 = _mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k]);
u_j7 = _mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k]);
u_j6 = _mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k]);
u_j5 = _mm_load_ps(&p2[i+j*it_j+it_j +k*it_k]);
__m128 tmp2i = _mm_mul_ps(_mm_add_ps(u_j4,u_j8),X4_i);
__m128 tmp3 = _mm_mul_ps(_mm_add_ps(u_j3,u_j7),X3_i);
__m128 tmp4 = _mm_mul_ps(_mm_add_ps(u_j2,u_j6),X2_i);
__m128 tmp5 = _mm_mul_ps(_mm_add_ps(u_j1,u_j5),X1_i);
__m128 tmp6 = _mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5));
__m128 tmp7 = _mm_add_ps(tmp6,center);
_mm_store_ps(&tmp2[i],tmp7); //Line 196
}
When I compile (icc) the above code without #pragma ivdep
I get the following message:
remark: loop was not vectorized: existence of vector dependence.
vector dependence: assumed FLOW dependence between tmp2 line 196 and tmp2 line 196.
vector dependence: assumed ANTI dependence between tmp2 line 196 and tmp2 line 196.
When I compile (icc) it with the #pragma ivdep
, I get the following message:
remark: loop was not vectorized: unsupported data type. //Line 180
Why is there a dependence suggested for Line 196? How can I eliminate the suggested vector dependence?
Vectorized code refers to operations that are performed on multiple components of a vector at the. same time (in one statement). Note that the addition (arithmetic operation) in the left code fragment. is performed on all (multiple) components of the vectors a and b in one statement—the operands of.
Simple Example. • Loop vectorization transforms a program so that the. same operation is performed at the same time on several. vector elements.
"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.
Vectorization is a type of parallel processing. It enables more computer hardware to be devoted to performing the computation, so the computation is done faster.
The problem is that you're trying to use auto-vectorization together with hand vectorized code. The compiler says that the line can't be vectorize because you can't vectorize a vector function.
Either let the compiler to auto vectorize it, or disable auto vectorization and manually vectorize your code. As already commented too, the auto vectorizer will calculate vectorization profitability: it checks if it's worth or not to vectorize your code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With