Today I decided to benchmark and compare some differences in gcc optimizability of std::vector
and std::array
. Generally, I found what I expected: performing a task on each of a collection of short arrays is much faster than performing the tasks on a collection equivalent vectors.
However, I found something unexpected: using std::vector
to store the collection of arrays is faster than using std::array
. Just in case it was the result of some artifact of a large amount of data on the stack, I also tried allocating it as an array on the heap and in a C-style array on the heap (but the results still resemble an array of arrays on the stack and a vector of arrays).
Any idea why std::vector
would ever outperform std::array
(on which the compiler has more compile-time information)?
I compiled using gcc-4.7 -std=c++11 -O3
(gcc-4.6 -std=c++0x -O3
should also result in this conundrum). Runtimes were computed using the bash
-native time
command (user time).
Code:
#include <array>
#include <vector>
#include <iostream>
#include <assert.h>
#include <algorithm>
template <typename VEC>
double fast_sq_dist(const VEC & lhs, const VEC & rhs) {
assert(lhs.size() == rhs.size());
double result = 0.0;
for (int k=0; k<lhs.size(); ++k) {
double tmp = lhs[k] - rhs[k];
result += tmp * tmp;
}
return result;
}
int main() {
const std::size_t K = 20000;
const std::size_t N = 4;
// declare the data structure for the collection
// (uncomment exactly one of these to time it)
// array of arrays
// runtime: 1.32s
std::array<std::array<double, N>, K > mat;
// array of arrays (allocated on the heap)
// runtime: 1.33s
// std::array<std::array<double, N>, K > & mat = *new std::array<std::array<double, N>, K >;
// C-style heap array of arrays
// runtime: 0.93s
// std::array<double, N> * mat = new std::array<double, N>[K];
// vector of arrays
// runtime: 0.93
// std::vector<std::array<double, N> > mat(K);
// vector of vectors
// runtime: 2.16s
// std::vector<std::vector<double> > mat(K, std::vector<double>(N));
// fill the collection with some arbitrary values
for (std::size_t k=0; k<K; ++k) {
for (std::size_t j=0; j<N; ++j)
mat[k][j] = k*N+j;
}
std::cerr << "constructed" << std::endl;
// compute the sum of all pairwise distances in the collection
double tot = 0.0;
for (std::size_t j=0; j<K; ++j) {
for (std::size_t k=0; k<K; ++k)
tot += fast_sq_dist(mat[j], mat[k]);
}
std::cout << tot << std::endl;
return 0;
}
NB 1: All versions print the same result.
NB 2: And just to demonstrate that the runtime differences between std::array<std::array<double, N>, K>
, std::vector<std::array<double, N> >
, and std::vector<std::vector<double> >
wasn't simply from assignment/initialization when allocating, the runtimes of simply allocating the collection (i.e. commenting out the computation and printing of tot
) were 0.000s, 0.000s, and 0.004s, respectively.
NB 3: Each method is compiled and run separately (not timed back-to-back within the same executable), to prevent unfair differences in caching.
NB 4:
Assembly for array of arrays: http://ideone.com/SM8dB
Assembly for vector of arrays: http://ideone.com/vhpJv
Assembly for vector of vectors: http://ideone.com/RZTNE
NB 5: Just to be absolutely clear, I am in no way intending to criticize STL. A absolutely love STL and, not only do I use it frequently, details of effective use have taught me a lot of subtle and great features of C++. Instead, this is an intellectual pursuit: I was simply timing things to learn principles of efficient C++ design.
Furthermore, it would be unsound to blame STL, because it is difficult to deconvolve the etiology of the runtime differential: With optimizations turned on, it can be from compiler optimizations that slow the code rather than quicken it. With optimizations turned off, it can be from unnecessary copy operations (that would be optimized out and never be executed in production code), which can be biased against certain data types more than others.
If you are curious like me, I'd love your help figuring this out.
A std::vector can never be faster than an array, as it has (a pointer to the first element of) an array as one of its data members. But the difference in run-time speed is slim and absent in any non-trivial program. One reason for this myth to persist, are examples that compare raw arrays with mis-used std::vectors.
The time difference is 0.333 seconds. Or a difference of 0.000000000333 per array access. Assuming a 2.33 GHz Processor like mine that's 0.7 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses.
Arrays are slightly more compact: the size is implicit. Arrays are non-resizable; sometimes this is desirable. Arrays don't require parsing extra STL headers (compile time). It can be easier to interact with straight-C code with an array (e.g. if C is allocating and C++ is using).
Array is memory efficient data structure. Vector takes more time in accessing elements. Array access elements in constant time irrespective of their location as elements are arranged in a contiguous memory allocation.
Consider the second and third tests. Conceptually, they are identical: Allocate K * N * sizeof(double)
bytes off the heap and then access them in exactly the same way. So why the different times?
All of your "faster" tests have one thing in common: new[]
. All of the slower tests are allocated with new
or on the stack. vector
probably uses new[]
Under the Hood™. The only obvious cause for this is that new[]
and new
have more significantly different implementations than expected.
What I'm going to suggest is that new[]
will fall back to mmap
and allocate directly on a page boundary, giving you an alignment speedup, whereas the other two methods will not allocate on a page boundary.
Consider using an OS allocation function to directly map committed pages, and then place a std::array<std::array<double, N>, K>
into it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With