Unable to detect why the following piece of code was not vectorized

Tags:

I have been struggling with vectorizing a particular application for sometime now and I have tried everything. From autovectorization, to handcoded SSE intrinsics. But somehow I am unable to obtain speedup on my stencil based application.

Following is a snippet of my current code, which I have vectorized using SSE intrinsics. When I compile (Intel icc) it using -vec-report3 I constantly obtain this message:
remark: loop was not vectorized: statement cannot be vectorized.

  #pragma ivdep
  for ( i = STENCIL; i < z - STENCIL; i+=4 )
  {
    it = it2 + i;

    __m128 tmp2i = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k])),X4_i); //loop was not vectorized: statement cannot be vectorized
    __m128 tmp3 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k])),X3_i);
    __m128 tmp4 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k])),X2_i);
    __m128 tmp5 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j +k*it_k])),X1_i);

    __m128 tmp6 = _mm_add_ps(_mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5)), _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i));

    _mm_store_ps(&tmp2[i],tmp6);

   }

Am I missing something crucial? Since the message doesnt elaborate as to why it cannot be vectorized, I am finding it difficult to ascertain the bottleneck.

UPDATE: After careful consideration of the suggestions, I tweaked the code the following way. I thought it best to break it down further, to identify the statements that actually are responsible for the vector dependence.

//#pragma ivdep
  for ( i = STENCIL; i < z - STENCIL; i+=4 )
  {
    it = it2 + i;
    __m128 center = _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i);

    u_j4 = _mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]); //Line 180
    u_j3 = _mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]);
    u_j2 = _mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]);
    u_j1 = _mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]);
    u_j8 = _mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k]);
    u_j7 = _mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k]);
    u_j6 = _mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k]);
    u_j5 = _mm_load_ps(&p2[i+j*it_j+it_j +k*it_k]);

    __m128 tmp2i = _mm_mul_ps(_mm_add_ps(u_j4,u_j8),X4_i);
    __m128 tmp3 = _mm_mul_ps(_mm_add_ps(u_j3,u_j7),X3_i);
    __m128 tmp4 = _mm_mul_ps(_mm_add_ps(u_j2,u_j6),X2_i);
    __m128 tmp5 = _mm_mul_ps(_mm_add_ps(u_j1,u_j5),X1_i);

    __m128 tmp6 = _mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5));
    __m128 tmp7 = _mm_add_ps(tmp6,center);

    _mm_store_ps(&tmp2[i],tmp7);  //Line 196

   }

When I compile (icc) the above code without #pragma ivdep I get the following message:

remark: loop was not vectorized: existence of vector dependence.
vector dependence: assumed FLOW dependence between tmp2 line 196 and tmp2 line 196.
vector dependence: assumed ANTI dependence between tmp2 line 196 and tmp2 line 196.

When I compile (icc) it with the #pragma ivdep, I get the following message:

remark: loop was not vectorized: unsupported data type. //Line 180

Why is there a dependence suggested for Line 196? How can I eliminate the suggested vector dependence?

378

asked Jul 16 '12 19:07

PGOnTheGo

1 Answers

The problem is that you're trying to use auto-vectorization together with hand vectorized code. The compiler says that the line can't be vectorize because you can't vectorize a vector function.

Either let the compiler to auto vectorize it, or disable auto vectorization and manually vectorize your code. As already commented too, the auto vectorizer will calculate vectorization profitability: it checks if it's worth or not to vectorize your code.

answered Oct 27 '22 16:10

hdante

Related questions
                            
                                OSX equivalent for IP_RECVERR
                            
                                Memory allocation for C program
                            
                                How to selectively disable -Werror using #pragma with gcc
                            
                                STM32 USB OTG HOST Library hangs trying to create file with FatFs
                            
                                how to avoid overhead of openMP in nested loops
                            
                                Is there a way to tell GCC not to reorder any instructions, not just load/stores?
                            
                                How to add an own struct_info.json? (emscripten)
                            
                                Compiling C library for android
                            
                                How do I get the number of variables alive in a for loop?
                            
                                Abort trap 6: when running gdb on Mac OS
                            
                                How to allocate memory from specific region
                            
                                Are there any guarantees about consistency of __LINE__ directives?
                            
                                Can we declare a file scope identifier with internal linkage without the static keyword?
                            
                                How do I get GCC to put a char in ah/bh/ch/dh?
                            
                                Rounding differences on Windows vs Unix based system in sprintf
                            
                                JNI: From C code to Java and JNI
                            
                                How can I uniquely identify a machine in C?
                            
                                Vim - Show data type
                            
                                C program is faster as Python subprocess
                            
                                C and C++ compilers with "aggressive" volatile semantics

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unable to detect why the following piece of code was not vectorized

Tags:

c

vectorization

icc

sse

stencils

PGOnTheGo

People also ask

1 Answers

hdante

Recent Activity

Donate For Us