GCC auto-vectorization has no effect on runtime, even when supposedly "profitable"

Q: Is it possible to generate vectorized code with GCC-O0?

With -O0, however, vectorized code won't be generated, even for very simple examples. I suspect that gcc's tree vectorizer isn't even called with -O0, or called and bails out, but that has to be verified in the gcc source code.

Q: How does auto vectorization work in a compiler?

In compilers, optimizations happen in phases, where each optimization phase prepares the ground for the next one. For auto vectorization to occur, at least on non trivial examples, the compiler has to perform some optimizations beforehand.

Q: Is GCC's tree vectorizer called with-O0?

I suspect that gcc's tree vectorizer isn't even called with -O0, or called and bails out, but that has to be verified in the gcc source code. Generally, -O0 and auto vectorization don't mix very well. In compilers, optimizations happen in phases, where each optimization phase prepares the ground for the next one.

Q: Is it possible to vectorize a loop?

For example, loops that contain jumps usually cannot be vectorized, unless branches are eliminated and replaced with predicated instructions by an optimization called if-conversion - resulting in a flat block of code, which could be vectorized more conviniently.

Tags:

c

gcc

auto-vectorization

I've spent the past few days reading about autovectorization with gcc 4.7. I followed some examples I saw online, and the setup seems to be correct. But when I actually run with the code and compare between vectorization on or off, there isn't a noticeable difference in runtime.

Here's the code I've been working with:

#include <string.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <stdio.h>
#include <math.h>

int main(int argc, char** argv) {

    long b = strtol(argv[2], NULL, 0); 
    unsigned long long int i;
    unsigned long long int n = (int)pow(2,29);                                                                                                                                                                                            
    float total = 0;

    float *__restrict__ x1; 
    float *__restrict__ y1; 

    posix_memalign((void *)&x1, 16, sizeof(float)*n);
    posix_memalign((void *)&y1, 16, sizeof(float)*n);


    float *__restrict__ x = __builtin_assume_aligned(x1,16);
    float *__restrict__ y = __builtin_assume_aligned(y1,16);

    for (i=0;i<n;i++) {
            x[i] = i;
            y[i] = i;
    }   

    for (i=0; i<n; i++) {
            y[i] += x[i];
    }   

    printf("y[%li]: \t\t\t\t%f\n",  b,y[b]);
    printf("correct answer: \t\t\t%f\n", (b)*2);
    return 0;
}

Some of this stuff seems redundant to me, but was necessary to get the compiler to understand what was going on (especially the fact that the data were aligned). The "b" variable that's read from command line is just there because I was paranoid about the compiler optimizing away the loop entirely.

Here is the compiler command when vectorizeration is enabled:

gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -ftree-vectorize -o v

Basically, this is equivalent to just using -O3. I put the flags in myself so that all I needed to do was remove "ftree-vectorize" and be able to test the result sans vectorization.

Here is the output of the ftree-vectorize-verbose flag, to show that the code is in fact being vectorized:

Analyzing loop at test.c:29

29: vect_model_load_cost: aligned.
29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_load_cost: aligned.
29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_store_cost: aligned.
29: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
29: cost model: Adding cost of checks for loop versioning aliasing.

29: Cost model analysis: 
  Vector inside of loop cost: 4
  Vector outside of loop cost: 4
  Scalar iteration cost: 4
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 2

29:   Profitability threshold = 3


Vectorizing loop at test.c:29

29: Profitability threshold is 3 loop iterations.
29: created 1 versioning for alias checks.

29: LOOP VECTORIZED.
Analyzing loop at test.c:24

24: vect_model_induction_cost: inside_cost = 2, outside_cost = 2 .
24: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
24: not vectorized: relevant stmt not supported: D.5806_18 = (float) D.5823_58;

test.c:7: note: vectorized 1 loops in function.

Note that the vectorization is profitable after 3 iterations, and I'm running with 2^29~=500,000,000 iterations. So I should expect a vastly different runtime with vectorization turned off, right?

Well, here are the runtimes of the code (I ran it 20 times in a row):

Throwing away that weird ~370s outlier, that gives a mean runtime of 58.7s, with a standard deviation of 6.0s.

Next, I'll compile with the same command as before, but with no -ftree-vectorize flag:

gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -o nov

Again running the program 20 times in a row yields the following times:

Again throwing away outliers, this gives a mean runtimee of 57.9s with a standard deviation of 3.6s.

So these two versions have statistically indistinguishable runtimes.

Can anyone point me to what I'm doing wrong? Does the "profitability threshold" spit out by the compiler not mean what I think it means? I really appreciate any help people can give me, I've been trying to figure this out for the past week.

EDIT:

I implemented the change that @nilspipenbrinck suggested, and it seems to have worked. I stuck the vectorized loop in a function, and called that function a boatload of times. The relative run-times are now 24.0s (sigma of <0.1s) for no vectorization vs 20.8s (sigma of <0.2s) for vectorization, or a 13% speed improvement. Not as much as I was hoping for, but at least now I know its working! Thanks for taking the time to look at my question and write an answer, I really appreciate it.

780

asked Nov 15 '14 12:11

user2635263

2 Answers

You don't do much arithmetic. Therefore the runtime of your test code is memory bound. E.g. you spend most of the time by moving the data between the CPU and memory.

Furthermore your n is very large with 2^29 elements. Therefore you don't benefit from the first and second level cache in any way.

If you want to see improvements with SSE, use a smaller n such that you only touch 8 or 16 kilobyte of data. Also make sure that the data is 'hot' e.g. it has recently been accessed by the CPU. That way the data does not have to be moved from main memory but it gets moved from the caches which is several magnitudes faster.

As an alternative you could also do a lot more arithmetic. This would give the memory prefetch system a chance to fetch the data from main memory in the background while you utilize the CPU doing math.

Summarized: If the arithmetic is faster than your system can move the memory around you will not see any benefits. Memory access times will be the bottleneck and the few cycles you save using the SSE instruction set will get lost in the noise of memory access timings.

answered Nov 14 '22 04:11

Nils Pipenbrinck

There are several factors that determine how profitable will be to vectorize code. In this case (basing this on the output you provided) the compiler is only vectorizing one loop, I would think that's the second one because the first one would be typically ignored since there is not enough computation being done for it to be profitable to vectorize.

The running times you are posting are for the whole code, not for the loop alone, so there is only so much vectorizing will do for the overall running time. If you really want to see how much improvement is there from vectorization I would suggest running a profiler such as AMD Code XL, Intel Vtune, OProfile and such, it will tell you specifically for that loop how much improvement in terms of time and performance you are making.

Right now I'm working in evaluation of vectorizing compilers, and I would code running upt to 60 times faster with vectorization, other times the speedup is not as impressive, and it all depends on the loop, the compiler and the architecture you are using.

answered Nov 14 '22 03:11

David

Related questions
                            
                                Get requested address in socket programming with C
                            
                                What's the difference between `extern int (x)[]` and `extern int x[]` in C?
                            
                                XS Memory leak in this code?
                            
                                If I define an array in if statement then does memory get allocated?
                            
                                Header Files in C
                            
                                Union and endianness
                            
                                Difference between raise(SIGABRT) and abort() methods
                            
                                Adding Zero padding to an array
                            
                                Positioning Devices (Intersecting Circles)
                            
                                When exactly is gcc __attribute__((constructor)) run?
                            
                                Why can I read from STDOUT and get a user's terminal input?
                            
                                Left shift of negative values by 0 positions?
                            
                                Create dynamic library (.so file) from Python code
                            
                                C11 and C++11 problems in Sublime Text 3
                            
                                Can an __attribute__ specifier be used with both the function prototype and the function definition?
                            
                                Implementation of generalized hypergeometric function pFq [closed]
                            
                                What class entries should I use to throw non-default exceptions from a C extension?
                            
                                In ARM Linux, what is the purpose of the few bytes reserved at the "bottom" of kernel stack for each thread
                            
                                Extract public key from EVP_PKEY keypair?
                            
                                Is reading a "zero" from memory faster than reading other values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With