The introductory links I found while searching:
As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:
template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
size_t end, const std::vector<T> &p2) {
typename std::vector<T>::const_iterator it2 = p2.begin();
//#pragma simd
//#pragma omp parallel for
//#pragma GCC ivdep Unroll Vector
for (size_t i = start; i < end; ++i, ++it2) {
p1[i] = p1[i] - *it2;
p1[i] += 1;
}
}
int main()
{
size_t n;
double x,y;
n = 12800000;
vector<double> v,u;
for(size_t i=0; i<n; ++i) {
x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
foo(v,0,n,u);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
return 0;
}
I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector
:
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.
Is there any hope? Or the optimization flag O3
just does the trick? Any suggestions to speedup this code (the foo
function) are welcome!
My version of g++:
samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.
EDIT
An answer saying that there is nothing more that can be done is also acceptable!
There are two ways to vectorize a loop computation in a C/C++ program. Programmers can use intrinsics inside the C/C++ source code to tell compilers to generate specific SIMD instructions so as to vectorize the loop computation. Or, compilers may be setup to vectorize the loop computation automatically.
Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.
The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips.
Vectorization is a programming technique that uses vector operations instead of element-by-element loop-based operations. Besides frequently producing more succinct Octave code, vectorization also allows for better optimization in the subsequent implementation.
The O3
flag turns on -ftree-vectorize
automatically. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options
So in both cases the compiler is trying to do loop vectorization.
Using g++ 4.8.2 to compile with:
# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test
Gives this:
Analyzing loop at test.cpp:16
Vectorizing loop at test.cpp:16
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39
test.cpp:16: note: created 1 versioning for alias checks.
test.cpp:16: note: LOOP VECTORIZED.
Analyzing loop at test_old.cpp:29
test.cpp:22: note: vectorized 1 loops in function.
test.cpp:18: note: Unroll loop 7 times
test.cpp:16: note: Unroll loop 7 times
test.cpp:28: note: Unroll loop 1 times
Compiling without the -ftree-vectorize
flag:
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test
Returns only this:
test_old.cpp:16: note: Unroll loop 7 times
test_old.cpp:28: note: Unroll loop 1 times
Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.
I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.
But here's a couple of other things you can try too:
Use the __restrict__
qualifier to tell the compiler that there is no overlap between the arrays.
Tell the compiler the arrays are aligned with __builtin_assume_aligned
(not portable)
Here's my resulting code (I removed the template since you will want to use different alignment for different data types)
#include <iostream>
#include <chrono>
#include <vector>
void foo( double * __restrict__ p1,
double * __restrict__ p2,
size_t start,
size_t end )
{
double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));
for (size_t i = start; i < end; ++i)
{
pA1[i] = pA1[i] - pA2[i];
pA1[i] += 1;
}
}
int main()
{
size_t n;
double x, y;
n = 12800000;
std::vector<double> v,u;
for(size_t i=0; i<n; ++i) {
x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
foo(&v[0], &u[0], 0, n );
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
return 0;
}
Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)
GCC has extensions to the compiler that creates new primitives that will use SIMD instructions. Take a look here for details.
Most compilers say they will auto-vectorize operations but this depends on the compiler pattern matching, but as you imagine this can be very hit and miss.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With