The introductory links I found while searching: <ol> <li>6.59.14 Loop-Specific Pragmas</li> <li>2.100 Pragma Loop_Optimize</li> <li>How to give hint to gcc about loop count</li> <li>Tell gcc to specifically unroll a loop</li> <li>How to Force Vectorization in C++</li> </ol> As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code: <pre class="prettyprint"><code>template<typename T> //__attribute__((optimize("unroll-loops"))) //__attribute__ ((pure)) void foo(std::vector<T> &p1, size_t start, size_t end, const std::vector<T> &p2) { typename std::vector<T>::const_iterator it2 = p2.begin(); //#pragma simd //#pragma omp parallel for //#pragma GCC ivdep Unroll Vector for (size_t i = start; i < end; ++i, ++it2) { p1[i] = p1[i] - *it2; p1[i] += 1; } } int main() { size_t n; double x,y; n = 12800000; vector<double> v,u; for(size_t i=0; i<n; ++i) { x = i; y = i - 1; v.push_back(x); u.push_back(y); } using namespace std::chrono; high_resolution_clock::time_point t1 = high_resolution_clock::now(); foo(v,0,n,u); high_resolution_clock::time_point t2 = high_resolution_clock::now(); duration<double> time_span = duration_cast<duration<double>>(t2 - t1); std::cout << "It took me " << time_span.count() << " seconds."; std::cout << std::endl; return 0; } </code></pre> I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this <code>#pragma GCC ivdep Unroll Vector</code>: <pre class="prettyprint"><code>samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test samaras@samaras-A15:~/Downloads$ ./test It took me 0.026575 seconds. samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test samaras@samaras-A15:~/Downloads$ ./test It took me 0.0252697 seconds. </code></pre> Is there any hope? Or the optimization flag <code>O3</code> just does the trick? Any suggestions to speedup this code (the <code>foo</code> function) are welcome! My version of g++: <pre class="prettyprint"><code>samaras@samaras-A15:~/Downloads$ g++ --version g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1 </code></pre> <hr> Notice that the body of the loop is random. I am not interesting in re-writing it in some other form. <hr> EDIT An answer saying that there is nothing more that can be done is also acceptable!

GCC has extensions to the compiler that creates new primitives that will use SIMD instructions. Take a look here for details. Most compilers say they will auto-vectorize operations but this depends on the compiler pattern matching, but as you imagine this can be very hit and miss.

How to vectorize my loop with g++?

Tags:

c++

optimization

vectorization

g++

loop-unrolling

The introductory links I found while searching:

6.59.14 Loop-Specific Pragmas
2.100 Pragma Loop_Optimize
How to give hint to gcc about loop count
Tell gcc to specifically unroll a loop
How to Force Vectorization in C++

As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:

template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
            size_t end, const std::vector<T> &p2) {
  typename std::vector<T>::const_iterator it2 = p2.begin();
  //#pragma simd
  //#pragma omp parallel for
  //#pragma GCC ivdep Unroll Vector
  for (size_t i = start; i < end; ++i, ++it2) {
    p1[i] = p1[i] - *it2;
    p1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x,y;
    n = 12800000;
    vector<double> v,u;
    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }
    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(v,0,n,u);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;
    return 0;
}

I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector:

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.

Is there any hope? Or the optimization flag O3 just does the trick? Any suggestions to speedup this code (the foo function) are welcome!

My version of g++:

samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.

EDIT

An answer saying that there is nothing more that can be done is also acceptable!

742

asked Mar 27 '15 03:03

gsamaras

2 Answers

The O3 flag turns on -ftree-vectorize automatically. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

So in both cases the compiler is trying to do loop vectorization.

Using g++ 4.8.2 to compile with:

# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

Gives this:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                
Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                
test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                
test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:28: note: Unroll loop 1 times

Compiling without the -ftree-vectorize flag:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

Returns only this:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

But here's a couple of other things you can try too:

Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.
Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

172

answered Sep 21 '22 06:09

David Saxon

GCC has extensions to the compiler that creates new primitives that will use SIMD instructions. Take a look here for details.

Most compilers say they will auto-vectorize operations but this depends on the compiler pattern matching, but as you imagine this can be very hit and miss.

answered Sep 23 '22 06:09

doron

Related questions
                            
                                g++ error using -flto option
                            
                                building static version of Poco C++ libraries [closed]
                            
                                Is there a relation between RTTI and exceptions?
                            
                                Use streambuf as buffer for boost asio read and write
                            
                                how to use boost barrier
                            
                                Why is there ambiguity between uint32_t and uint64_t when using size_t on Mac OS X?
                            
                                Does it make sense for unary operators to be associative?
                            
                                Using homebrew, gcc and llvm with C++ 11
                            
                                When does OpenGL get finished with pointers in functions?
                            
                                Do I need to synchronize std::condition_variable/condition_variable_any::notify_one
                            
                                Fast way to generate code functions from header functions in Visual Studio? [duplicate]
                            
                                Passing a C++ std::Vector to numpy array in Python
                            
                                Why do I get an infinite loop if I enter a letter rather than a number? [duplicate]
                            
                                how to declare and use "one writer, many readers, one process, simple type" variable?
                            
                                Why are initializer lists not available when changing the allocator of std::vector?
                            
                                why no need of forward declaration in static dispatching via templates?
                            
                                Can a large number of warnings increase compilation time?
                            
                                What is the distinction between implicitly-declared and implicitly-defined copy constructors?
                            
                                What is the difference between regex_token_iterator and regex_iterator?
                            
                                Using sizeof on a typedef instead of a local variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With