In our code base we have a lot of operation like j*ω*X where j is the imaginary unit, ω is real and X is complex. Actually a lot of loops could look like:
#include <complex>
#include <vector>
void mult_jomega(std::vector<std::complex<double> > &vec, double omega){
std::complex<double> jomega(0.0, omega);
for (auto &x : vec){
x*=jomega;
}
}
However, we exploit the fact that the real part of jomega
is zero and write the multiplication as:
void mult_jomega_smart(cvector &vec, double omega){
for (auto &x : vec){
x={-omega*x.imag(), omega*x.real()};
}
}
At the beginning, I was dismissive of this "smart"-version, because
However, as some performance-regressions have shown, the third argument doesn't hold water. When comparing these two functions (see listings further below), the smart version outperforms consistently with -O2
as well as with -O3
:
size orig(musec) smart(musec) speedup
10 0.039928 0.0117551 3.39665
100 0.328564 0.0861379 3.81439
500 1.62269 0.417475 3.8869
1000 3.33012 0.760515 4.37877
2000 6.46696 1.56048 4.14422
10000 32.2827 9.2361 3.49528
100000 326.828 115.158 2.8381
500000 1660.43 850.415 1.95249
The smart version is about 4 times faster on my machine (gcc-5.4), and only as the task becomes more and more memory-bound with increasing size of the array the speed-up drops to factor 2.
My question is, what prevents the complier from optimizing the less smart but more readable version, after all, the compiler can see, that the real part of jomega
is zero? Is it possible to help compiler with optimization by giving some additional compile-flags?
NB: Speedup exists also for other compilers:
compiler speedup
g++-5.4 4
g++-7.2 4
clang++-3.8 2 [original version 2-times faster than gcc]
Listings:
mult.cpp - to prevent inlining:
#include <complex>
#include <vector>
typedef std::vector<std::complex<double> > cvector;
void mult_jomega(cvector &vec, double omega){
std::complex<double> jomega(0.0, omega);
for (auto &x : vec){
x*=jomega;
}
}
void mult_jomega_smart(cvector &vec, double omega){
for (auto &x : vec){
x={-omega*x.imag(), omega*x.real()};
}
}
main.cpp:
#include <chrono>
#include <complex>
#include <vector>
#include <iostream>
typedef std::vector<std::complex<double> > cvector;
void mult_jomega(cvector &vec, double omega);
void mult_jomega2(cvector &vec, double omega);
void mult_jomega_smart(cvector &vec, double omega);
const size_t N=100000; //10**5
const double OMEGA=1.0;//use 1, so nothing changes -> no problems with inf & Co
void compare_results(const cvector &vec){
cvector m=vec;
cvector m_smart=vec;
mult_jomega(m, 5.0);
mult_jomega_smart(m_smart,5.0);
std::cout<<m[0]<<" vs "<<m_smart[0]<<"\n";
std::cout<< (m==m_smart ? "equal!" : "not equal!")<<"\n";
}
void test(size_t vector_size){
cvector vec(vector_size, std::complex<double>{1.0, 1.0});
//compare results, triger if in doubt
//compare_results(vec);
//warm_up, just in case:
for(size_t i=0;i<N;i++)
mult_jomega(vec, OMEGA);
//test mult_jomega:
auto begin = std::chrono::high_resolution_clock::now();
for(size_t i=0;i<N;i++)
mult_jomega(vec, OMEGA);
auto end = std::chrono::high_resolution_clock::now();
auto time_jomega=std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e3;
//test mult_jomega_smart:
begin = std::chrono::high_resolution_clock::now();
for(size_t i=0;i<N;i++)
mult_jomega_smart(vec, OMEGA);
end = std::chrono::high_resolution_clock::now();
auto time_jomega_smart=std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e3;
double speedup=time_jomega/time_jomega_smart;
std::cout<<vector_size<<"\t"<<time_jomega/N<<"\t"<<time_jomega_smart/N<<"\t"<<speedup<<"\n";
}
int main(){
std::cout<<"N\tmult_jomega(musec)\tmult_jomega_smart(musec)\tspeedup\n";
for(const auto &size : std::vector<size_t>{10,100,500,1000,2000,10000,100000,500000})
test(size);
}
Building & running:
g++ main.cpp mult.cpp -O3 -std=c++11 -o mult_test
./mult_test
Compiling with the flag -ffast-math
results in fast performance.
N mult_jomega(musec) mult_jomega_smart(musec) speedup
10 0.00860809 0.00818644 1.05151
100 0.0706683 0.0693907 1.01841
500 0.29569 0.297323 0.994509
1000 0.582059 0.57622 1.01013
2000 1.30809 1.24758 1.0485
10000 7.37559 7.4854 0.98533
Edit: More specifically, it's the -funsafe-math-optimizations
compiler flag. According to the documentation, this flag is used to
allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When
Edit 2: Even more specifically, it is the -fno-signed-zeros
option, which:
Allows optimizations for floating-point arithmetic that ignore the signedness of zero. IEEE arithmetic specifies the behavior of distinct
+0.0
and-0.0
values, which then prohibits simplification of expressions such asx+0.0
or0.0*x
(even with-ffinite-math-only
). This option implies that the sign of a zero result isn’t significant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With