I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course. The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time. I use <code>g++</code> 5.x on Ubuntu 16.04. edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with <code>-O2</code>, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with <code>-O3</code> or <code>-funroll-loops</code>, it is as if the compiler kicks in early and optimize before the pragma is seen. I came up with the following, simple test: <pre class="prettyprint"><code>#define SIZE 10000000 #define TRIES 200 int main(){ std::vector<double> a,b,c; a.resize(SIZE); b.resize(SIZE); c.resize(SIZE); double start = omp_get_wtime(); unsigned long int i,t; #pragma omp parallel shared(a,b,c) private(i,t) { for( t = 0; t< TRIES; t++){ #pragma omp for for( i = 0; i< SIZE; i++){ c[i] = a[i] + b[i]; } } } std::cout << "finished in " << omp_get_wtime() - start << std::endl; return 0; } </code></pre> I compile with <pre class="prettyprint"><code> g++ -O3 -fopenmp -std=c++11 main.cpp </code></pre> And get for one threads <pre class="prettyprint"><code>>time ./a.out finished in 2.5638 ./a.out 2.58s user 0.04s system 99% cpu 2.619 total. </code></pre> For two threads, loop takes 1.2s, for 1.23 total. Now if I use raw arrays: <pre class="prettyprint"><code> int main(){ double *a, *b, *c; a = new double[SIZE]; b = new double[SIZE]; c = new double[SIZE]; double start = omp_get_wtime(); unsigned long int i,t; #pragma omp parallel shared(a,b,c) private(i,t) { for( t = 0; t< TRIES; t++) { #pragma omp for for( i = 0; i< SIZE; i++) { c[i] = a[i] + b[i]; } } } std::cout << "finished in " << omp_get_wtime() - start << std::endl; delete[] a; delete[] b; delete[] c; return 0; } </code></pre> And i get (1 thread): <pre class="prettyprint"><code>>time ./a.out finished in 1.92901 ./a.out 1.92s user 0.01s system 99% cpu 1.939 total </code></pre> <code>std::vector</code> is 33% slower! For two threads: <pre class="prettyprint"><code>>time ./a.out finished in 1.20061 ./a.out 2.39s user 0.02s system 198% cpu 1.208 total </code></pre> As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions. Now, i thought the <code>std::vector</code> has almost no overhead? What is happening here? I'd like to use nice standard library objects. I cannot find any reference anywhere on a simple example like this.

<h3>Meaningful benchmarking is hard</h3> The answer from Xirema has already outlined in detail the difference in the code. <code>std::vector::reserve</code> initializes the data to zero, whereas <code>new double[size]</code> does not. Note that you can use <code>new double[size]()</code> to force initalization. However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized? <h3>Minimal example</h3> Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself). <pre class="prettyprint"><code>#include <vector> #include <chrono> #include <iostream> #include <memory> #include <iomanip> #include <cstring> #include <string> #include <sys/types.h> #include <unistd.h> constexpr size_t size = 10'000'000; auto time_pointer(size_t reps, bool initialize, double init_value) { double * a = new double[size]; double * b = new double[size]; double * c = new double[size]; if (initialize) { for (size_t i = 0; i < size; i++) { a[i] = b[i] = c[i] = init_value; } } auto start = std::chrono::steady_clock::now(); for (size_t t = 0; t < reps; t++) { for (size_t i = 0; i < size; i++) { c[i] = a[i] + b[i]; } } auto end = std::chrono::steady_clock::now(); delete[] a; delete[] b; delete[] c; return end - start; } int main(int argc, char* argv[]) { bool initialize = (argc == 3); double init_value = 0; if (initialize) { init_value = std::stod(argv[2]); } auto reps = std::stoll(argv[1]); std::cout << "pid: " << getpid() << "\n"; auto t = time_pointer(reps, initialize, init_value); std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl; return 0; } </code></pre> Results are consistent: <pre class="prettyprint"><code>./a.out 50 # no initialization 657ms ./a.out 50 0. # with initialization 1005ms </code></pre> <h3>First glance at performance counters</h3> Using the excellent Linux <code>perf</code> tool: <pre class="prettyprint"><code>$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 pid: 12481 626ms Performance counter stats for './a.out 50': 101.589.231 LLC-loads 105.415 dTLB-misses 0,629369979 seconds time elapsed $ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0. pid: 12499 1008ms Performance counter stats for './a.out 50 0.': 145.218.903 LLC-loads 1.889.286 dTLB-misses 1,096923077 seconds time elapsed </code></pre> Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses? <h3>Memory is complex</h3> To understand that, we need to understand how memory is allocated. Just because a <code>malloc</code> / <code>new</code> returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where <code>page-types</code> (from <code>linux/tools/vm</code> - and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark: <h3>With initialization</h3> <pre class="prettyprint"><code> flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000804 1 0 __R________M______________________________ referenced,mmap 0x000000000004082c 392 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable 0x000000000000086c 335 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap 0x0000000000401800 56721 221 ___________Ma_________t___________________ mmap,anonymous,thp 0x0000000000005868 1807 7 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x0000000000405868 111 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp 0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 59368 231 </code></pre> Most of the virtual memory is in a normal <code>mmap,anonymous</code> region - something that is mapped to a physical address. <h3>Without initialization</h3> <pre class="prettyprint"><code> flags page-count MB symbolic-flags long-symbolic-flags 0x0000000001000000 1174 4 ________________________z_________________ zero_page 0x0000000001400000 37888 148 ______________________t_z_________________ thp,zero_page 0x0000000000000800 1 0 ___________M______________________________ mmap 0x000000000004082c 388 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable 0x000000000000086c 347 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap 0x0000000000401800 18907 73 ___________Ma_________t___________________ mmap,anonymous,thp 0x0000000000005868 633 2 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x0000000000405868 37 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp 0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 59376 231 </code></pre> Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind <code>a</code> and <code>b</code> is all backed by a single read-only 4kiB page filled with zeros. <code>c</code> (and <code>a</code>, <code>b</code> in the other test) have already been written to, so it has to have it's own memory. <h3>0 != 0</h3> Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you <code>memset(0)</code>, <code>a[i] = 0.</code>, or <code>std::vector::reserve</code> - everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the <code>memset</code> / <code>reserve</code> is to use <code>calloc</code> to explicitly request zero'd memory, which is probably backed by a <code>zero_page</code>, but I doubt it is done (or makes a lot of sense). Remember that for <code>new double[size];</code> or <code>malloc</code> there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory. 1: Remember that the double 0.0 has all bits set to zero. In the end the performance difference really comes only from the loop, but is caused by initialization. <code>std::vector</code> carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.

Why is adding two std::vectors slower than raw arrays from new[]?

Tags:

c++

compiler-optimization

linear-algebra

openmp

I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.

The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time. I use g++ 5.x on Ubuntu 16.04.

edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3 or -funroll-loops, it is as if the compiler kicks in early and optimize before the pragma is seen.

I came up with the following, simple test:

#define SIZE 10000000
#define TRIES 200
int main(){

    std::vector<double> a,b,c;
    a.resize(SIZE);
    b.resize(SIZE);
    c.resize(SIZE);

    double start = omp_get_wtime();
    unsigned long int i,t;
    #pragma omp parallel shared(a,b,c) private(i,t)
    {
    for( t = 0; t< TRIES; t++){
       #pragma omp for
       for( i = 0; i< SIZE; i++){
        c[i] = a[i] + b[i];
       }
    }
    }

    std::cout << "finished in " << omp_get_wtime() - start << std::endl;

    return 0;

}

I compile with

   g++ -O3 -fopenmp  -std=c++11 main.cpp

And get for one threads

>time ./a.out 
 finished in 2.5638
 ./a.out  2.58s user 0.04s system 99% cpu 2.619 total.

For two threads, loop takes 1.2s, for 1.23 total.

Now if I use raw arrays:

 int main(){
    double *a, *b, *c;
    a = new double[SIZE];
    b = new double[SIZE];
    c = new double[SIZE];
    double start = omp_get_wtime();
    unsigned long int i,t;
    #pragma omp parallel shared(a,b,c) private(i,t)
    {
       for( t = 0; t< TRIES; t++)
       {
          #pragma omp for
          for( i = 0; i< SIZE; i++)
          {
             c[i] = a[i] + b[i];
          }
       }
    }

    std::cout << "finished in " << omp_get_wtime() - start << std::endl;

    delete[] a;
    delete[] b;
    delete[] c;

    return 0;
}

And i get (1 thread):

>time ./a.out
 finished in 1.92901 
  ./a.out  1.92s user 0.01s system 99% cpu 1.939 total

std::vector is 33% slower!

For two threads:

>time ./a.out 
finished in 1.20061                                                              
./a.out  2.39s user 0.02s system 198% cpu 1.208 total

As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.

Now, i thought the std::vector has almost no overhead? What is happening here? I'd like to use nice standard library objects.

I cannot find any reference anywhere on a simple example like this.

541

asked Jun 01 '17 16:06

Napseis

1 Answers

Meaningful benchmarking is hard

The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve initializes the data to zero, whereas new double[size] does not. Note that you can use new double[size]() to force initalization.

However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?

Minimal example

Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).

#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>

constexpr size_t size = 10'000'000;

auto time_pointer(size_t reps, bool initialize, double init_value) {
    double * a = new double[size];
    double * b = new double[size];
    double * c = new double[size];
    
    if (initialize) {
        for (size_t i = 0; i < size; i++) {
            a[i] = b[i] = c[i] = init_value;
        }
    }

    auto start = std::chrono::steady_clock::now();
    
    for (size_t t = 0; t < reps; t++) {
        for (size_t i = 0; i < size; i++) {
            c[i] = a[i] + b[i];
        }
    }

    auto end = std::chrono::steady_clock::now();

    delete[] a;
    delete[] b;
    delete[] c;
       
    return end - start;
}

int main(int argc, char* argv[]) {
    bool initialize = (argc == 3);
    double init_value = 0;
    if (initialize) {
        init_value = std::stod(argv[2]);
    }
    auto reps = std::stoll(argv[1]);
    std::cout << "pid: " << getpid() << "\n";
    auto t = time_pointer(reps, initialize, init_value);
    std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
    return 0;
}

Results are consistent:

./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms

First glance at performance counters

Using the excellent Linux perf tool:

$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50  
pid: 12481
         626ms

 Performance counter stats for './a.out 50':

       101.589.231      LLC-loads                                                   
           105.415      dTLB-misses                                                 

       0,629369979 seconds time elapsed

$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
        1008ms

 Performance counter stats for './a.out 50 0.':

       145.218.903      LLC-loads                                                   
         1.889.286      dTLB-misses                                                 

       1,096923077 seconds time elapsed

Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?

Memory is complex

To understand that, we need to understand how memory is allocated. Just because a malloc / new returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types (from linux/tools/vm - and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:

With initialization

                 flags  page-count       MB  symbolic-flags         long-symbolic-flags
    0x0000000000000804           1        0  __R________M______________________________ referenced,mmap
    0x000000000004082c         392        1  __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
    0x000000000000086c         335        1  __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
    0x0000000000401800       56721      221  ___________Ma_________t___________________ mmap,anonymous,thp
    0x0000000000005868        1807        7  ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x0000000000405868         111        0  ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
    0x000000000000586c           1        0  __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                 total       59368      231

Most of the virtual memory is in a normal mmap,anonymous region - something that is mapped to a physical address.

Without initialization

             flags  page-count       MB  symbolic-flags         long-symbolic-flags
0x0000000001000000        1174        4  ________________________z_________________ zero_page
0x0000000001400000       37888      148  ______________________t_z_________________ thp,zero_page
0x0000000000000800           1        0  ___________M______________________________ mmap
0x000000000004082c         388        1  __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c         347        1  __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800       18907       73  ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868         633        2  ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868          37        0  ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c           1        0  __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total       59376      231

Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a and b is all backed by a single read-only 4kiB page filled with zeros. c (and a, b in the other test) have already been written to, so it has to have it's own memory.

0 != 0

Now it may look weird: everything here is zero¹ - why does it matter how it became zero? Whether you memset(0), a[i] = 0., or std::vector::reserve - everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset / reserve is to use calloc to explicitly request zero'd memory, which is probably backed by a zero_page, but I doubt it is done (or makes a lot of sense). Remember that for new double[size]; or malloc there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.

¹: Remember that the double 0.0 has all bits set to zero.

In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.

175

answered Sep 20 '22 10:09

Zulan

Related questions
                            
                                Representation of Large Graph with 100 million nodes in C++
                            
                                undefined reference to `std::__cxx11::basic_string in Boost on Travis CI
                            
                                Prevent passing temporary for const ref parameter
                            
                                How to draw polygons in SDL? [closed]
                            
                                Why is it possible to use the dereferencing operator multiple times when assigning a function to a function pointer? [duplicate]
                            
                                Difference between inner and outer overloaded C++ operator
                            
                                Multithreaded Files Reading
                            
                                Is move semantics just a shallow copy and setting other's pointers to null?
                            
                                Why do lambda functions drop deduced return type reference by default?
                            
                                Why isn't enable_if working here?
                            
                                What does Q_REQUIRED_RESULT do?
                            
                                How to in-place initialize an array?
                            
                                How to link shared library *dll with CMake in Windows
                            
                                Handling pointers when wrapping C++ class with Cython
                            
                                How to compile Tensor Flow with SSE and AVX instructions on Windows?
                            
                                Visual Studio 2017: _mm_load_ps often compiled to movups
                            
                                Why can't std::iterator_traits find value_type?
                            
                                Why C++ implicit conversion works, but explicit one does not?
                            
                                map with const keys but non const values?
                            
                                swig/python: object does not support indexing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With