Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Auto vectorization not working

I'm trying to get my code to auto vectorize, but it isn't working.

int _tmain(int argc, _TCHAR* argv[])
{
    const int N = 4096;
    float x[N];
    float y[N];
    float sum = 0;

    //create random values for x and y 
    for (int i = 0; i < N; i++)
    {
        x[i] = rand() >> 1;
        y[i] = rand() >> 1;
    }

    for (int i = 0; i < N; i++){
        sum += x[i] * y[i];
    }
}

Neither loop vectorizes here, but I'm really only interested in the second loop.

I'm using visual studio express 2013 and am compiling with the /O2 and /Qvec-report:2(To report whether or not the loop was vectorized) options. When I compile, I get the following message:

--- Analyzing function: main
c:\users\...\documents\visual studio 2013\projects\intrin3\intrin3\intrin3.cpp(28) : info C5002: loop not vectorized due to reason '1200'
c:\users\...\documents\visual studio 2013\projects\intrin3\intrin3\intrin3.cpp(41) : info C5002: loop not vectorized due to reason '1305'

Reason '1305', as can be seen HERE, says that "the compiler can't discern proper vectorizable type information for this loop." I'm not really sure what this means. Any ideas?

After splitting the second loop into two loops:

for (int i = 0; i < N; i++){
    sumarray[i] = x[i] * y[i];
}

for (int i = 0; i < N; i++){
    sum += sumarray[i];
}

Now the first of the above loops vectorizes, but the second one does not, again with error code 1305.

like image 522
Jon B. Jones Avatar asked Apr 30 '14 02:04

Jon B. Jones


People also ask

What is GCC vectorization?

GCC Autovectorization flagsGCC is an advanced compiler, and with the optimization flags -O3 or -ftree-vectorize the compiler will search for loop vectorizations (remember to specify the -mavx flag too). The source code remains the same, but the compiled code by GCC is completely different.

What is Vectorised code?

Vectorized code refers to operations that are performed on multiple components of a vector at the. same time (in one statement). Note that the addition (arithmetic operation) in the left code fragment. is performed on all (multiple) components of the vectors a and b in one statement—the operands of.

What is difference between vectorization and loops?

"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

What is loop Vectorisation?

Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.


2 Answers

The error 1305 happens because the optimizer did not vectorize the loop since the value sum is not used. Simply adding printf("%d\n", sum) fixes that. But then you get a new error code 1105 "Loop includes a non-recognized reduction operation". To fix this you need you need to set /fp:fast

The reason is that floating point arithmetic is not associative and reductions using SIMD or MIMD (i.e. using multiple threads) need to be associative. By using a looser floating point model you can do the reduction.

I just tested it with the following code and the default fp:precise does not vectorize and when I use fp:fast it does.

#include <stdio.h>
int main() {
    const int N = 4096;
    float x[N];
    float y[N];
    float sum = 0;
    for (int i = 0; i < N; i++){
        sum += x[i] * y[i];
    }
    printf("sum %f\n", sum);
}

In regards to your question about the loop with the rand() function the rand() function is not a SIMD function. It can't be vectorized. You need to find a SIMD rand() function. I don't know of one. An alternative is pre-compute an array of random numbers and use the array instead. In any case rand() is a horrible random number generate and is only useful for some toy cases. Consider using the Mersenne twister PRNG.

like image 93
Z boson Avatar answered Sep 28 '22 23:09

Z boson


One problem could be that your stack allocation isn't necessarily aligned by your compiler. If your compiler supports c++11 you could use:

float x[N] alignas(16);
float y[N] alignas(16);

To explicitly get 16 byte aligned memory, which is required by most SSE operations.


EDIT:

Even if alignment isn't the issue and your compiler is vectorizing unaligned code you should make this optimization as unaligned SSE operations are very slow compared to their aligned counterparts.

like image 30
RamblingMad Avatar answered Sep 29 '22 01:09

RamblingMad