I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.
The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time.
I use g++
5.x on Ubuntu 16.04.
edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2
, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3
or -funroll-loops
, it is as if the compiler kicks in early and optimize before the pragma is seen.
I came up with the following, simple test:
#define SIZE 10000000
#define TRIES 200
int main(){
std::vector<double> a,b,c;
a.resize(SIZE);
b.resize(SIZE);
c.resize(SIZE);
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++){
#pragma omp for
for( i = 0; i< SIZE; i++){
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
return 0;
}
I compile with
g++ -O3 -fopenmp -std=c++11 main.cpp
And get for one threads
>time ./a.out
finished in 2.5638
./a.out 2.58s user 0.04s system 99% cpu 2.619 total.
For two threads, loop takes 1.2s, for 1.23 total.
Now if I use raw arrays:
int main(){
double *a, *b, *c;
a = new double[SIZE];
b = new double[SIZE];
c = new double[SIZE];
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++)
{
#pragma omp for
for( i = 0; i< SIZE; i++)
{
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
And i get (1 thread):
>time ./a.out
finished in 1.92901
./a.out 1.92s user 0.01s system 99% cpu 1.939 total
std::vector
is 33% slower!
For two threads:
>time ./a.out
finished in 1.20061
./a.out 2.39s user 0.02s system 198% cpu 1.208 total
As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.
Now, i thought the std::vector
has almost no overhead? What is happening here? I'd like to use nice standard library objects.
I cannot find any reference anywhere on a simple example like this.
A std::vector can never be faster than an array, as it has (a pointer to the first element of) an array as one of its data members. But the difference in run-time speed is slim and absent in any non-trivial program. One reason for this myth to persist, are examples that compare raw arrays with mis-used std::vectors.
The time difference is 0.333 seconds. Or a difference of 0.000000000333 per array access. Assuming a 2.33 GHz Processor like mine that's 0.7 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses.
In both cases, the array version goes almost twice as fast as the vector version.
Vector occupies more memory. Array is memory efficient data structure. Vector takes more time in accessing elements. Array access elements in constant time irrespective of their location as elements are arranged in a contiguous memory allocation.
The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve
initializes the data to zero, whereas new double[size]
does not. Note that you can use new double[size]()
to force initalization.
However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?
Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).
#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t size = 10'000'000;
auto time_pointer(size_t reps, bool initialize, double init_value) {
double * a = new double[size];
double * b = new double[size];
double * c = new double[size];
if (initialize) {
for (size_t i = 0; i < size; i++) {
a[i] = b[i] = c[i] = init_value;
}
}
auto start = std::chrono::steady_clock::now();
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
auto end = std::chrono::steady_clock::now();
delete[] a;
delete[] b;
delete[] c;
return end - start;
}
int main(int argc, char* argv[]) {
bool initialize = (argc == 3);
double init_value = 0;
if (initialize) {
init_value = std::stod(argv[2]);
}
auto reps = std::stoll(argv[1]);
std::cout << "pid: " << getpid() << "\n";
auto t = time_pointer(reps, initialize, init_value);
std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
return 0;
}
Results are consistent:
./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms
Using the excellent Linux perf
tool:
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50
pid: 12481
626ms
Performance counter stats for './a.out 50':
101.589.231 LLC-loads
105.415 dTLB-misses
0,629369979 seconds time elapsed
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
1008ms
Performance counter stats for './a.out 50 0.':
145.218.903 LLC-loads
1.889.286 dTLB-misses
1,096923077 seconds time elapsed
Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?
To understand that, we need to understand how memory is allocated. Just because a malloc
/ new
returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types
(from linux/tools/vm
- and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000804 1 0 __R________M______________________________ referenced,mmap
0x000000000004082c 392 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 335 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 56721 221 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 1807 7 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 111 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59368 231
Most of the virtual memory is in a normal mmap,anonymous
region - something that is mapped to a physical address.
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000001000000 1174 4 ________________________z_________________ zero_page
0x0000000001400000 37888 148 ______________________t_z_________________ thp,zero_page
0x0000000000000800 1 0 ___________M______________________________ mmap
0x000000000004082c 388 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 347 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 18907 73 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 633 2 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 37 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59376 231
Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a
and b
is all backed by a single read-only 4kiB page filled with zeros. c
(and a
, b
in the other test) have already been written to, so it has to have it's own memory.
Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you memset(0)
, a[i] = 0.
, or std::vector::reserve
- everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset
/ reserve
is to use calloc
to explicitly request zero'd memory, which is probably backed by a zero_page
, but I doubt it is done (or makes a lot of sense). Remember that for new double[size];
or malloc
there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.
1: Remember that the double 0.0 has all bits set to zero.
In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector
carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With