Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

std::complex multiplication is extremely slow

Tags:

std

gcc

I have noticed that multiplying two std::complex values is much, much slower using the overloaded * operator than writing out the operation. I've seen a 50x difference. That is totally ridiculous. I understand that the operator needs to check for NaN in the input, because of how complex infinity is defined. Can that really account for a 50x time difference?

I'm using GCC 5.4.0 with the flags -O3 -mavx -mavx2 -msse2 -mfma -mbmi.

Here's the test code:

#include <iostream>
#include <complex>
#include <chrono>
#include <vector>

int main( void ) {
  size_t N = 10000;
  std::vector< std::complex< double >> inbuf( N );
  for( size_t k = 0; k < N; ++k ) {
     inbuf[ k ] = std::complex< double >( std::rand(), std::rand() ) / ( double )RAND_MAX - 0.5;
  }

  std::complex< double > c2 = { 0, 0 };
  auto t0 = std::chrono::steady_clock::now();
  for( size_t i = 0; i < 10000; ++i ) {
     for( size_t j = 0; j < N - 1; ++j ) {
        double re = inbuf[ j ].real() * inbuf[ j + 1 ].real() - inbuf[ j ].imag() * inbuf[ j + 1 ].imag();
        double im = inbuf[ j ].real() * inbuf[ j + 1 ].imag() + inbuf[ j ].imag() * inbuf[ j + 1 ].real();
        c2.real( c2.real() + re );
        c2.imag( c2.imag() + im );
     }
  }
  auto t1 = std::chrono::steady_clock::now();
  double time = ( std::chrono::duration< float >( t1 - t0 ) ).count();
  std::cout << c2 << " using manual *: " << time << std::endl;

  c2 = { 0, 0 };
  t0 = std::chrono::steady_clock::now();
  for( size_t i = 0; i < 10000; ++i ) {
     for( size_t j = 0; j < N - 1; ++j ) {
        c2 += inbuf[ j ] * inbuf[ j + 1 ];
     }
  }
  t1 = std::chrono::steady_clock::now();
  time = ( std::chrono::duration< float >( t1 - t0 ) ).count();
  std::cout << c2 << " using stdlib *: " << time << std::endl;
  return 0;
}

Here's the output:

(-2.45689e+07,-134386) using manual *: 0.109344
(-2.45689e+07,-134386) using stdlib *: 5.4286

Edit: Given the different results by folks in the comments, I have done a bit more testing with various compile options. It turns out that the -mfma and the -mavx switches cause the "stdlib" version to be so slow. The -mfma switch gives the "manual" version a ~25% performance boost, but slows down the "stdlib" version about 13x:

cris@carrier:~/tmp/tests> g++ complex_test.cpp -o complex_test -O3 -std=c++11
cris@carrier:~/tmp/tests> ./complex_test                                     
(-2.45689e+07,-134386) using manual *:0.138276
(-2.45689e+07,-134386) using stdlib *:0.412056
cris@carrier:~/tmp/tests> g++ complex_test.cpp -o complex_test -O3 -mfma -std=c++11 
cris@carrier:~/tmp/tests> ./complex_test                                                  
(-2.45689e+07,-134386) using manual *:0.106551
(-2.45689e+07,-134386) using stdlib *:5.37662

I also tried clang-800 (Mac OS) and didn't see this extreme slow-down. g++-5 on Mac does the same as g++-5 on Linux. Maybe I've found a compiler bug?

like image 872
Cris Luengo Avatar asked Mar 01 '26 23:03

Cris Luengo


1 Answers

This post worried me greatly! Use -O3 and -ffast-math (or -Ofast), and the differences go away.

gcc differences between -O3 vs -Ofast optimizations

Straight:

g++ -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 5.20456
(-2.50606e+07,-29494.2) using stdlib *: 4.02066

Ofast:

g++ -Ofast -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 0.154484
(-2.50606e+07,-29494.2) using stdlib *: 0.155045

O3:

g++ -O3 -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 0.193446
(-2.50606e+07,-29494.2) using stdlib *: 0.350336

O3 + ffast-math:

g++ -O3 -ffast-math -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 0.154603
(-2.50606e+07,-29494.2) using stdlib *: 0.156592

ffast-math:

g++ -ffast-math -std=c++11 timingComplex.cpp && ./a.out
(-2.50606e+07,-29494.2) using manual *: 5.17364
(-2.50606e+07,-29494.2) using stdlib *: 4.0194

like image 108
ajb204 Avatar answered Mar 04 '26 17:03

ajb204