I have noticed that multiplying two std::complex values is much, much slower using the overloaded * operator than writing out the operation. I've seen a 50x difference. That is totally ridiculous. I understand that the operator needs to check for NaN in the input, because of how complex infinity is defined. Can that really account for a 50x time difference?
I'm using GCC 5.4.0 with the flags -O3 -mavx -mavx2 -msse2 -mfma -mbmi.
Here's the test code:
#include <iostream>
#include <complex>
#include <chrono>
#include <vector>
int main( void ) {
size_t N = 10000;
std::vector< std::complex< double >> inbuf( N );
for( size_t k = 0; k < N; ++k ) {
inbuf[ k ] = std::complex< double >( std::rand(), std::rand() ) / ( double )RAND_MAX - 0.5;
}
std::complex< double > c2 = { 0, 0 };
auto t0 = std::chrono::steady_clock::now();
for( size_t i = 0; i < 10000; ++i ) {
for( size_t j = 0; j < N - 1; ++j ) {
double re = inbuf[ j ].real() * inbuf[ j + 1 ].real() - inbuf[ j ].imag() * inbuf[ j + 1 ].imag();
double im = inbuf[ j ].real() * inbuf[ j + 1 ].imag() + inbuf[ j ].imag() * inbuf[ j + 1 ].real();
c2.real( c2.real() + re );
c2.imag( c2.imag() + im );
}
}
auto t1 = std::chrono::steady_clock::now();
double time = ( std::chrono::duration< float >( t1 - t0 ) ).count();
std::cout << c2 << " using manual *: " << time << std::endl;
c2 = { 0, 0 };
t0 = std::chrono::steady_clock::now();
for( size_t i = 0; i < 10000; ++i ) {
for( size_t j = 0; j < N - 1; ++j ) {
c2 += inbuf[ j ] * inbuf[ j + 1 ];
}
}
t1 = std::chrono::steady_clock::now();
time = ( std::chrono::duration< float >( t1 - t0 ) ).count();
std::cout << c2 << " using stdlib *: " << time << std::endl;
return 0;
}
Here's the output:
(-2.45689e+07,-134386) using manual *: 0.109344
(-2.45689e+07,-134386) using stdlib *: 5.4286
Edit: Given the different results by folks in the comments, I have done a bit more testing with various compile options. It turns out that the -mfma and the -mavx switches cause the "stdlib" version to be so slow. The -mfma switch gives the "manual" version a ~25% performance boost, but slows down the "stdlib" version about 13x:
cris@carrier:~/tmp/tests> g++ complex_test.cpp -o complex_test -O3 -std=c++11
cris@carrier:~/tmp/tests> ./complex_test
(-2.45689e+07,-134386) using manual *:0.138276
(-2.45689e+07,-134386) using stdlib *:0.412056
cris@carrier:~/tmp/tests> g++ complex_test.cpp -o complex_test -O3 -mfma -std=c++11
cris@carrier:~/tmp/tests> ./complex_test
(-2.45689e+07,-134386) using manual *:0.106551
(-2.45689e+07,-134386) using stdlib *:5.37662
I also tried clang-800 (Mac OS) and didn't see this extreme slow-down. g++-5 on Mac does the same as g++-5 on Linux. Maybe I've found a compiler bug?
This post worried me greatly! Use -O3 and -ffast-math (or -Ofast), and the differences go away.
gcc differences between -O3 vs -Ofast optimizations
Straight:
g++ -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 5.20456
(-2.50606e+07,-29494.2) using stdlib *: 4.02066
Ofast:
g++ -Ofast -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 0.154484
(-2.50606e+07,-29494.2) using stdlib *: 0.155045
O3:
g++ -O3 -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 0.193446
(-2.50606e+07,-29494.2) using stdlib *: 0.350336
O3 + ffast-math:
g++ -O3 -ffast-math -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 0.154603
(-2.50606e+07,-29494.2) using stdlib *: 0.156592
ffast-math:
g++ -ffast-math -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 5.17364
(-2.50606e+07,-29494.2) using stdlib *: 4.0194
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With