Today I decided to benchmark and compare some differences in gcc optimizability of <code>std::vector</code> and <code>std::array</code>. Generally, I found what I expected: performing a task on each of a collection of short arrays is much faster than performing the tasks on a collection equivalent vectors. However, I found something unexpected: using <code>std::vector</code> to store the collection of arrays is faster than using <code>std::array</code>. Just in case it was the result of some artifact of a large amount of data on the stack, I also tried allocating it as an array on the heap and in a C-style array on the heap (but the results still resemble an array of arrays on the stack and a vector of arrays). Any idea why <code>std::vector</code> would ever outperform <code>std::array</code> (on which the compiler has more compile-time information)? I compiled using <code>gcc-4.7 -std=c++11 -O3</code> (<code>gcc-4.6 -std=c++0x -O3</code> should also result in this conundrum). Runtimes were computed using the <code>bash</code>-native <code>time</code> command (user time). Code: <pre class="prettyprint"><code>#include <array> #include <vector> #include <iostream> #include <assert.h> #include <algorithm> template <typename VEC> double fast_sq_dist(const VEC & lhs, const VEC & rhs) { assert(lhs.size() == rhs.size()); double result = 0.0; for (int k=0; k<lhs.size(); ++k) { double tmp = lhs[k] - rhs[k]; result += tmp * tmp; } return result; } int main() { const std::size_t K = 20000; const std::size_t N = 4; // declare the data structure for the collection // (uncomment exactly one of these to time it) // array of arrays // runtime: 1.32s std::array<std::array<double, N>, K > mat; // array of arrays (allocated on the heap) // runtime: 1.33s // std::array<std::array<double, N>, K > & mat = *new std::array<std::array<double, N>, K >; // C-style heap array of arrays // runtime: 0.93s // std::array<double, N> * mat = new std::array<double, N>[K]; // vector of arrays // runtime: 0.93 // std::vector<std::array<double, N> > mat(K); // vector of vectors // runtime: 2.16s // std::vector<std::vector<double> > mat(K, std::vector<double>(N)); // fill the collection with some arbitrary values for (std::size_t k=0; k<K; ++k) { for (std::size_t j=0; j<N; ++j) mat[k][j] = k*N+j; } std::cerr << "constructed" << std::endl; // compute the sum of all pairwise distances in the collection double tot = 0.0; for (std::size_t j=0; j<K; ++j) { for (std::size_t k=0; k<K; ++k) tot += fast_sq_dist(mat[j], mat[k]); } std::cout << tot << std::endl; return 0; } </code></pre> NB 1: All versions print the same result. NB 2: And just to demonstrate that the runtime differences between <code>std::array<std::array<double, N>, K></code>, <code>std::vector<std::array<double, N> ></code>, and <code>std::vector<std::vector<double> ></code> wasn't simply from assignment/initialization when allocating, the runtimes of simply allocating the collection (i.e. commenting out the computation and printing of <code>tot</code>) were 0.000s, 0.000s, and 0.004s, respectively. NB 3: Each method is compiled and run separately (not timed back-to-back within the same executable), to prevent unfair differences in caching. NB 4: Assembly for array of arrays: http://ideone.com/SM8dB Assembly for vector of arrays: http://ideone.com/vhpJv Assembly for vector of vectors: http://ideone.com/RZTNE NB 5: Just to be absolutely clear, I am in no way intending to criticize STL. A absolutely love STL and, not only do I use it frequently, details of effective use have taught me a lot of subtle and great features of C++. Instead, this is an intellectual pursuit: I was simply timing things to learn principles of efficient C++ design. Furthermore, it would be unsound to blame STL, because it is difficult to deconvolve the etiology of the runtime differential: With optimizations turned on, it can be from compiler optimizations that slow the code rather than quicken it. With optimizations turned off, it can be from unnecessary copy operations (that would be optimized out and never be executed in production code), which can be biased against certain data types more than others. If you are curious like me, I'd love your help figuring this out.

Consider the second and third tests. Conceptually, they are identical: Allocate <code>K * N * sizeof(double)</code> bytes off the heap and then access them in exactly the same way. So why the different times? All of your "faster" tests have one thing in common: <code>new[]</code>. All of the slower tests are allocated with <code>new</code> or on the stack. <code>vector</code> probably uses <code>new[]</code> Under the Hood™. The only obvious cause for this is that <code>new[]</code> and <code>new</code> have more significantly different implementations than expected. What I'm going to suggest is that <code>new[]</code> will fall back to <code>mmap</code> and allocate directly on a page boundary, giving you an alignment speedup, whereas the other two methods will not allocate on a page boundary. Consider using an OS allocation function to directly map committed pages, and then place a <code>std::array<std::array<double, N>, K></code> into it.

Why would array<T, N> ever be slower than vector<T>?

Tags:

c++

optimization

c++11

stl

Today I decided to benchmark and compare some differences in gcc optimizability of std::vector and std::array. Generally, I found what I expected: performing a task on each of a collection of short arrays is much faster than performing the tasks on a collection equivalent vectors.

However, I found something unexpected: using std::vector to store the collection of arrays is faster than using std::array. Just in case it was the result of some artifact of a large amount of data on the stack, I also tried allocating it as an array on the heap and in a C-style array on the heap (but the results still resemble an array of arrays on the stack and a vector of arrays).

Any idea why std::vector would ever outperform std::array (on which the compiler has more compile-time information)?

I compiled using gcc-4.7 -std=c++11 -O3 (gcc-4.6 -std=c++0x -O3 should also result in this conundrum). Runtimes were computed using the bash-native time command (user time).

Code:

#include <array>
#include <vector>
#include <iostream>
#include <assert.h>
#include <algorithm>

template <typename VEC>
double fast_sq_dist(const VEC & lhs, const VEC & rhs) {
  assert(lhs.size() == rhs.size());
  double result = 0.0;
  for (int k=0; k<lhs.size(); ++k) {
    double tmp = lhs[k] - rhs[k];
    result += tmp * tmp;
  }
  return result;
}

int main() {
  const std::size_t K = 20000;
  const std::size_t N = 4;

  // declare the data structure for the collection
  // (uncomment exactly one of these to time it)

  // array of arrays
  // runtime: 1.32s
  std::array<std::array<double, N>, K > mat;

  // array of arrays (allocated on the heap)
  // runtime: 1.33s
  //  std::array<std::array<double, N>, K > & mat = *new std::array<std::array<double, N>, K >;

  // C-style heap array of arrays
  // runtime: 0.93s
  //  std::array<double, N> * mat = new std::array<double, N>[K];

  // vector of arrays
  // runtime: 0.93
  //  std::vector<std::array<double, N> > mat(K);

  // vector of vectors
  // runtime: 2.16s
  //  std::vector<std::vector<double> > mat(K, std::vector<double>(N));

  // fill the collection with some arbitrary values
  for (std::size_t k=0; k<K; ++k) {
    for (std::size_t j=0; j<N; ++j)
      mat[k][j] = k*N+j;
  }

  std::cerr << "constructed" << std::endl;

  // compute the sum of all pairwise distances in the collection
  double tot = 0.0;
   for (std::size_t j=0; j<K; ++j) {
     for (std::size_t k=0; k<K; ++k)
       tot += fast_sq_dist(mat[j], mat[k]);
   }

   std::cout << tot << std::endl;

  return 0;
}

NB 1: All versions print the same result.

NB 2: And just to demonstrate that the runtime differences between std::array<std::array<double, N>, K>, std::vector<std::array<double, N> >, and std::vector<std::vector<double> > wasn't simply from assignment/initialization when allocating, the runtimes of simply allocating the collection (i.e. commenting out the computation and printing of tot) were 0.000s, 0.000s, and 0.004s, respectively.

NB 3: Each method is compiled and run separately (not timed back-to-back within the same executable), to prevent unfair differences in caching.

NB 4:
Assembly for array of arrays: http://ideone.com/SM8dB
Assembly for vector of arrays: http://ideone.com/vhpJv
Assembly for vector of vectors: http://ideone.com/RZTNE

NB 5: Just to be absolutely clear, I am in no way intending to criticize STL. A absolutely love STL and, not only do I use it frequently, details of effective use have taught me a lot of subtle and great features of C++. Instead, this is an intellectual pursuit: I was simply timing things to learn principles of efficient C++ design.

Furthermore, it would be unsound to blame STL, because it is difficult to deconvolve the etiology of the runtime differential: With optimizations turned on, it can be from compiler optimizations that slow the code rather than quicken it. With optimizations turned off, it can be from unnecessary copy operations (that would be optimized out and never be executed in production code), which can be biased against certain data types more than others.

If you are curious like me, I'd love your help figuring this out.

752

asked Jul 01 '12 01:07

user

1 Answers

Consider the second and third tests. Conceptually, they are identical: Allocate K * N * sizeof(double) bytes off the heap and then access them in exactly the same way. So why the different times?

All of your "faster" tests have one thing in common: new[]. All of the slower tests are allocated with new or on the stack. vector probably uses new[] Under the Hood™. The only obvious cause for this is that new[] and new have more significantly different implementations than expected.

What I'm going to suggest is that new[] will fall back to mmap and allocate directly on a page boundary, giving you an alignment speedup, whereas the other two methods will not allocate on a page boundary.

Consider using an OS allocation function to directly map committed pages, and then place a std::array<std::array<double, N>, K> into it.

162

answered Oct 12 '22 02:10

Puppy

Related questions
                            
                                Tool to create an amalgamation/combine all source files of a library into one for C/C++?
                            
                                Is there a C++ implementation for vEB Trees? [closed]
                            
                                C++ lzma compression and decompression of large stream by parts
                            
                                Can a string literal be subscripted in a constant expression?
                            
                                How to determine which CPU a thread runs on?
                            
                                unresolved external symbol for __declspec(dllimport) when using dll to export class
                            
                                std::unordered_set<T>::insert(T&&): is argument moved if it exists
                            
                                Eclipse CDT multithreaded debugging not-optimal - how does one run threads exclusively?
                            
                                How to integrate a library that uses expression templates?
                            
                                Linux AIO: Poor Scaling
                            
                                What happens in this code? (Executing a char buffer)
                            
                                How to profile from the command line on Mac OS X?
                            
                                Odd behavior passing static constexpr members without definitions by value
                            
                                Relationship between 'x' and L'x' and widen('x')
                            
                                Segfault with asio standalone when classes in separate files
                            
                                Constructors, templates and non-type parameters
                            
                                Is std::mutex sequentially consistent?
                            
                                Why are the C# and ECMAScript ISO standards freely available, but not C/C++?
                            
                                How do you judge the (real world) distance of an object in a picture?
                            
                                Specializing a template class as a struct

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With