I have set up a test program to compare array access performance to that of std::vector. I have found several similar questions but none seem to address my specific concern. I was scratching my head for some time over why array access seemed to be 6 times faster than vector access, when I have read in the past that they should be equivalent. As it turns out, this seems to be a function of the Intel compiler (v12) and optimization (occurs with anything above -O1), since I see better performance with std::vector when using gcc v4.1.2, and array has only a 2x advantage with gcc v4.4.4. I am running the tests on a RHEL 5.8 machine with Xeon X5355 cores. As an aside, I have found iterators to be faster than element access. I am compiling with the following commands: <pre class="prettyprint"><code>icpc -fast test.cc g++44 -O3 test.cc </code></pre> Can anyone explain the dramatic improvement in speed? <pre class="prettyprint"><code>#include <vector> #include <iostream> using namespace std; int main() { int sz = 100; clock_t start,stop; int ncycle=1000; float temp = 1.1; // Set up and initialize vector vector< vector< vector<float> > > A(sz, vector< vector<float> >(sz, vector<float>(sz, 1.0))); // Set up and initialize array float*** a = new float**[sz]; for( int i=0; i<sz; ++i) { a[i] = new float*[sz]; for( int j=0; j<sz; ++j) { a[i][j] = new float[sz](); for( int k=0; k<sz; ++k) a[i][j][k] = 1.0; } } // Time the array start = clock(); for( int n=0; n<ncycle; ++n ) for( int i=0; i<sz; ++i ) for( int j=0; j<sz; ++j ) for( int k=0; k<sz; ++k ) a[i][j][k] *= temp; stop = clock(); std::cout << "STD ARRAY: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" << std::endl; // Time the vector start = clock(); /* */ for( int n=0; n < ncycle; ++n ) for (vector<vector<vector<float> > >::iterator it1 = A.begin(); it1 != A.end(); ++it1) for (vector<vector<float> >::iterator it2 = it1->begin(); it2 != it1->end(); ++it2) for (vector<float>::iterator it3 =it2->begin(); it3 != it2->end(); ++it3) *it3 *= temp; /* for( int n=0; n < ncycle; ++n ) for( int i=0; i < sz; ++i ) for( int j=0; j < sz; ++j ) for( int k=0; k < sz; ++k ) A[i][j][k] *= temp; */ stop = clock(); std::cout << "VECTOR: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" << std::endl; for( int i=0; i<100; ++i) { for( int j=0; j<100; ++j) delete[] a[i][j]; } for( int i=0; i<100; ++i) { delete[] a[i]; } delete[] a; return 0; } </code></pre> SOLVED After noting Bo's indication that the compiler "knows everything" about the loop and can therefore optimize it more than the vector case, I replaced the multiplications by "temp" with multiplications by a call to "rand()". This leveled the playing field and in fact seems to give std::vector a slight lead. Timing of various scenarios are as follows: <pre class="prettyprint"><code>ARRAY (flat): 111.15 seconds ARRAY (flat): 0.011115 seconds per cycle ARRAY (3d): 111.73 seconds ARRAY (3d): 0.011173 seconds per cycle VECTOR (flat): 110.51 seconds VECTOR (flat): 0.011051 seconds per cycle VECTOR (3d): 118.05 seconds VECTOR (3d): 0.011805 seconds per cycle VECTOR (flat iterator): 108.55 seconds VECTOR (flat iterator): 0.010855 seconds per cycle VECTOR (3d iterator): 111.93 seconds VECTOR (3d iterator): 0.011193 seconds per cycle </code></pre> The takeaway seems to be that vectors are just as fast as arrays, and slightly faster when flattened (contiguous memory) and used with iterators. My experiment only averaged over 10,000 iterations, so it could be argued that these are all roughly equivalent and the choice of which to use should be determined by whichever is easiest to use; in my case, that would be the "3d iterator" case.

There is no black magic here, it is just too easy for the compiler to see that here <pre class="prettyprint"><code>for( int n=0; n<ncycle; ++n ) for( int i=0; i<sz; ++i ) for( int j=0; j<sz; ++j ) for( int k=0; k<sz; ++k ) a[i][j][k] *= temp; </code></pre> everything is known at compile time. It can easily unroll the loop to speed it up.

Intrinsic array access is much faster than std::vector access -- Black Magic?

Tags:

c++

iterator

arrays

vector

intel

I have set up a test program to compare array access performance to that of std::vector. I have found several similar questions but none seem to address my specific concern. I was scratching my head for some time over why array access seemed to be 6 times faster than vector access, when I have read in the past that they should be equivalent. As it turns out, this seems to be a function of the Intel compiler (v12) and optimization (occurs with anything above -O1), since I see better performance with std::vector when using gcc v4.1.2, and array has only a 2x advantage with gcc v4.4.4. I am running the tests on a RHEL 5.8 machine with Xeon X5355 cores. As an aside, I have found iterators to be faster than element access.

I am compiling with the following commands:

icpc -fast test.cc
g++44 -O3 test.cc

Can anyone explain the dramatic improvement in speed?

#include <vector>
#include <iostream>

using namespace std;

int main() {
  int sz = 100;
  clock_t start,stop;
  int ncycle=1000;
  float temp  = 1.1;

  // Set up and initialize vector
  vector< vector< vector<float> > > A(sz, vector< vector<float> >(sz,  vector<float>(sz, 1.0)));

  // Set up and initialize array
  float*** a = new float**[sz];
  for( int i=0; i<sz; ++i) {
    a[i] = new float*[sz];
    for( int j=0; j<sz; ++j) {
      a[i][j] = new float[sz]();
      for( int k=0; k<sz; ++k)
        a[i][j][k] = 1.0;
    }
  }

  // Time the array
  start = clock();
  for( int n=0; n<ncycle; ++n )
    for( int i=0; i<sz; ++i )
      for( int j=0; j<sz; ++j )
        for( int k=0; k<sz; ++k )
          a[i][j][k] *= temp;

  stop = clock();
  std::cout << "STD ARRAY: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds"     << std::endl;

  // Time the vector
      start = clock();
  /*
  */
  for( int n=0; n < ncycle; ++n )
    for (vector<vector<vector<float> > >::iterator it1 = A.begin(); it1 != A.end();     ++it1)
      for (vector<vector<float> >::iterator it2 = it1->begin(); it2 != it1->end();     ++it2)
        for (vector<float>::iterator it3 =it2->begin(); it3 != it2->end(); ++it3)
          *it3 *= temp;
  /*
     for( int n=0; n < ncycle; ++n )
       for( int i=0; i < sz; ++i )
         for( int j=0; j < sz; ++j )
           for( int k=0; k < sz; ++k )
             A[i][j][k] *= temp;
  */

  stop = clock();
  std::cout << "VECTOR: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" <<     std::endl;


  for( int i=0; i<100; ++i) {
    for( int j=0; j<100; ++j)
      delete[] a[i][j];
  }
  for( int i=0; i<100; ++i) {
    delete[] a[i];
  }
  delete[] a;
  return 0;
}

SOLVED

After noting Bo's indication that the compiler "knows everything" about the loop and can therefore optimize it more than the vector case, I replaced the multiplications by "temp" with multiplications by a call to "rand()". This leveled the playing field and in fact seems to give std::vector a slight lead. Timing of various scenarios are as follows:

ARRAY (flat): 111.15 seconds
ARRAY (flat): 0.011115 seconds per cycle
ARRAY (3d): 111.73 seconds
ARRAY (3d): 0.011173 seconds per cycle
VECTOR (flat): 110.51 seconds
VECTOR (flat): 0.011051 seconds per cycle
VECTOR (3d): 118.05 seconds
VECTOR (3d): 0.011805 seconds per cycle
VECTOR (flat iterator): 108.55 seconds
VECTOR (flat iterator): 0.010855 seconds per cycle
VECTOR (3d iterator): 111.93 seconds
VECTOR (3d iterator): 0.011193 seconds per cycle

The takeaway seems to be that vectors are just as fast as arrays, and slightly faster when flattened (contiguous memory) and used with iterators. My experiment only averaged over 10,000 iterations, so it could be argued that these are all roughly equivalent and the choice of which to use should be determined by whichever is easiest to use; in my case, that would be the "3d iterator" case.

783

asked Aug 23 '12 13:08

Ethereal

1 Answers

There is no black magic here, it is just too easy for the compiler to see that here

for( int n=0; n<ncycle; ++n )
   for( int i=0; i<sz; ++i )
     for( int j=0; j<sz; ++j )
       for( int k=0; k<sz; ++k )
          a[i][j][k] *= temp;

everything is known at compile time. It can easily unroll the loop to speed it up.

106

answered Sep 18 '22 10:09

Bo Persson

Related questions
                            
                                How do I use the Direct3D Device Manager?
                            
                                Simple Curve Fitting Implimentation in C++ (SVD Least Sqares Fit or similar)
                            
                                Decode h264 rtsp with ffmpeg and separated AVCodecContext
                            
                                Android: Getting random number from JNI method
                            
                                pthread_create differences in linux kernel 2.4.20 and 2.4.36
                            
                                How to make a looping/circular ScrollViewer in Windows 8 Metro (C++/XAML)
                            
                                Using decltype with virtual member function pointers
                            
                                How do I get the template type of a given element at runtime in C++?
                            
                                boost::asio async condition
                            
                                Cannot set Eclipse Juno C++ with std=c++11
                            
                                Optimization to find complex number as input
                            
                                Algorithm analysis: Am I analyzing these algorithms correctly? How to approach problems like these [closed]
                            
                                My linked-list node removing function causes other parts of my program to crash
                            
                                Color map in boost graph breadth_first_visit
                            
                                check if point exists in QPainterPath
                            
                                Return type of bitwise operators in C++
                            
                                calling functions above their declaration
                            
                                Can't compile in Linux with curl - undefined reference
                            
                                Using vanilla GCC (or Clang) with android NDK
                            
                                Why does C++ Boost distribution have `.dll`and `.lib` files?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With