I have set up a test program to compare array access performance to that of std::vector. I have found several similar questions but none seem to address my specific concern. I was scratching my head for some time over why array access seemed to be 6 times faster than vector access, when I have read in the past that they should be equivalent. As it turns out, this seems to be a function of the Intel compiler (v12) and optimization (occurs with anything above -O1), since I see better performance with std::vector when using gcc v4.1.2, and array has only a 2x advantage with gcc v4.4.4. I am running the tests on a RHEL 5.8 machine with Xeon X5355 cores. As an aside, I have found iterators to be faster than element access.
I am compiling with the following commands:
icpc -fast test.cc
g++44 -O3 test.cc
Can anyone explain the dramatic improvement in speed?
#include <vector>
#include <iostream>
using namespace std;
int main() {
int sz = 100;
clock_t start,stop;
int ncycle=1000;
float temp = 1.1;
// Set up and initialize vector
vector< vector< vector<float> > > A(sz, vector< vector<float> >(sz, vector<float>(sz, 1.0)));
// Set up and initialize array
float*** a = new float**[sz];
for( int i=0; i<sz; ++i) {
a[i] = new float*[sz];
for( int j=0; j<sz; ++j) {
a[i][j] = new float[sz]();
for( int k=0; k<sz; ++k)
a[i][j][k] = 1.0;
}
}
// Time the array
start = clock();
for( int n=0; n<ncycle; ++n )
for( int i=0; i<sz; ++i )
for( int j=0; j<sz; ++j )
for( int k=0; k<sz; ++k )
a[i][j][k] *= temp;
stop = clock();
std::cout << "STD ARRAY: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" << std::endl;
// Time the vector
start = clock();
/*
*/
for( int n=0; n < ncycle; ++n )
for (vector<vector<vector<float> > >::iterator it1 = A.begin(); it1 != A.end(); ++it1)
for (vector<vector<float> >::iterator it2 = it1->begin(); it2 != it1->end(); ++it2)
for (vector<float>::iterator it3 =it2->begin(); it3 != it2->end(); ++it3)
*it3 *= temp;
/*
for( int n=0; n < ncycle; ++n )
for( int i=0; i < sz; ++i )
for( int j=0; j < sz; ++j )
for( int k=0; k < sz; ++k )
A[i][j][k] *= temp;
*/
stop = clock();
std::cout << "VECTOR: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" << std::endl;
for( int i=0; i<100; ++i) {
for( int j=0; j<100; ++j)
delete[] a[i][j];
}
for( int i=0; i<100; ++i) {
delete[] a[i];
}
delete[] a;
return 0;
}
SOLVED
After noting Bo's indication that the compiler "knows everything" about the loop and can therefore optimize it more than the vector case, I replaced the multiplications by "temp" with multiplications by a call to "rand()". This leveled the playing field and in fact seems to give std::vector a slight lead. Timing of various scenarios are as follows:
ARRAY (flat): 111.15 seconds
ARRAY (flat): 0.011115 seconds per cycle
ARRAY (3d): 111.73 seconds
ARRAY (3d): 0.011173 seconds per cycle
VECTOR (flat): 110.51 seconds
VECTOR (flat): 0.011051 seconds per cycle
VECTOR (3d): 118.05 seconds
VECTOR (3d): 0.011805 seconds per cycle
VECTOR (flat iterator): 108.55 seconds
VECTOR (flat iterator): 0.010855 seconds per cycle
VECTOR (3d iterator): 111.93 seconds
VECTOR (3d iterator): 0.011193 seconds per cycle
The takeaway seems to be that vectors are just as fast as arrays, and slightly faster when flattened (contiguous memory) and used with iterators. My experiment only averaged over 10,000 iterations, so it could be argued that these are all roughly equivalent and the choice of which to use should be determined by whichever is easiest to use; in my case, that would be the "3d iterator" case.
A std::vector can never be faster than an array, as it has (a pointer to the first element of) an array as one of its data members. But the difference in run-time speed is slim and absent in any non-trivial program. One reason for this myth to persist, are examples that compare raw arrays with mis-used std::vectors.
In both cases, the array version goes almost twice as fast as the vector version.
Array is memory efficient data structure. Vector takes more time in accessing elements. Array access elements in constant time irrespective of their location as elements are arranged in a contiguous memory allocation.
The time difference is 0.333 seconds. Or a difference of 0.000000000333 per array access. Assuming a 2.33 GHz Processor like mine that's 0.7 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses.
There is no black magic here, it is just too easy for the compiler to see that here
for( int n=0; n<ncycle; ++n )
for( int i=0; i<sz; ++i )
for( int j=0; j<sz; ++j )
for( int k=0; k<sz; ++k )
a[i][j][k] *= temp;
everything is known at compile time. It can easily unroll the loop to speed it up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With