Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is C array so much faster than std::array? [duplicate]

We are currently writing some performance critical code in C++ that operates on many large matrices and vectors. Regarding to our research, there should be no major performance difference between std::array and standard C arrays (See This question or this). However, while testing we experienced a huge performance improvement by using C arrays over std::array. This is our demo code:

#include <iostream>
#include <array>
#include <sys/time.h>

#define ROWS 784
#define COLS 100
#define RUNS 50

using std::array;

void DotPComplex(array<double, ROWS> &result, array<double, ROWS> &vec1, array<double, ROWS> &vec2){
  for(int i = 0; i < ROWS; i++){
    result[i] = vec1[i] * vec2[i];
  }
}

void DotPSimple(double result[ROWS], double vec1[ROWS], double vec2[ROWS]){
  for(int i = 0; i < ROWS; i++){
    result[i] = vec1[i] * vec2[i];
  }
}

void MatMultComplex(array<double, ROWS> &result, array<array<double, COLS>, ROWS> &mat, array<double, ROWS> &vec){
  for (int i = 0; i < COLS; ++i) {
      for (int j = 0; j < ROWS; ++j) {
        result[i] += mat[i][j] * vec[j];
      }
  }
}

void MatMultSimple(double result[ROWS], double mat[ROWS][COLS], double vec[ROWS]){
  for (int i = 0; i < COLS; ++i) {
      for (int j = 0; j < ROWS; ++j) {
        result[i] += mat[i][j] * vec[j];
      }
  }
}

double getTime(){
    struct timeval currentTime;
    gettimeofday(&currentTime, NULL);
    double tmp = (double)currentTime.tv_sec * 1000.0 + (double)currentTime.tv_usec/1000.0;
    return tmp;
}

array<double, ROWS> inputVectorComplex = {{ 0 }};
array<double, ROWS> resultVectorComplex = {{ 0 }};
double inputVectorSimple[ROWS] = { 0 };
double resultVectorSimple[ROWS] = { 0 };

array<array<double, COLS>, ROWS> inputMatrixComplex = {{0}};
double inputMatrixSimple[ROWS][COLS] = { 0 };

int main(){
  double start;
  std::cout << "DotP test with C array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    DotPSimple(resultVectorSimple, inputVectorSimple, inputVectorSimple);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "DotP test with C++ array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    DotPComplex(resultVectorComplex, inputVectorComplex, inputVectorComplex);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "MatMult test with C array : " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    MatMultSimple(resultVectorSimple, inputMatrixSimple, inputVectorSimple);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "MatMult test with C++ array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    MatMultComplex(resultVectorComplex, inputMatrixComplex, inputVectorComplex);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;
}

Compiled with: icpc demo.cpp -std=c++11 -O0 This is the result:

DotP test with C array: 
Duration: 0.289795 ms
DotP test with C++ array: 
Duration: 1.98413 ms
MatMult test with C array : 
Duration: 28.3459 ms
MatMult test with C++ array: 
Duration: 175.15 ms

With -O3 flag:

DotP test with C array: 
Duration: 0.0280762 ms
DotP test with C++ array: 
Duration: 0.0288086 ms
MatMult test with C array : 
Duration: 1.78296 ms
MatMult test with C++ array: 
Duration: 4.90991 ms

The C array implementation is much faster without compiler optimization. Why? Using compiler optimizations the dot products are equally fast. But for the matrix multiplication there is still a significant speedup when using C arrays. Is there a way to achieve equal performance when using std::array?


Update:

The compiler used: icpc 17.0.0

With gcc 4.8.5 our code runs much slower than with the intel compiler using any optimization level. Therefore we are mainly interested in the behaviour of the intel compiler.

As suggested by Jonas we adjusted RUNS 50.000 with the following results (intel compiler):

With -O0 flag:

DotP test with C array: 
Duration: 201.764 ms
DotP test with C++ array: 
Duration: 1020.67 ms
MatMult test with C array : 
Duration: 15069.2 ms
MatMult test with C++ array: 
Duration: 123826 ms

With -O3 flag:

DotP test with C array: 
Duration: 16.583 ms
DotP test with C++ array: 
Duration: 15.635 ms
MatMult test with C array : 
Duration: 980.582 ms
MatMult test with C++ array: 
Duration: 2344.46 ms
like image 485
The Floe Avatar asked Apr 03 '17 11:04

The Floe


People also ask

Is std :: array slower than C array?

It will provide a value like semantics equally to the other C++ containers. A std::array should have same runtime performance as a c-style array.

Is std :: vector slower than array?

A std::vector can never be faster than an array, as it has (a pointer to the first element of) an array as one of its data members. But the difference in run-time speed is slim and absent in any non-trivial program. One reason for this myth to persist, are examples that compare raw arrays with mis-used std::vectors.

Is STD array slower?

std::array is designed as zero-overhead wrapper for C arrays that gives it the "normal" value like semantics of the other C++ containers. You should not notice any difference in runtime performance while you still get to enjoy the extra features.

How much slower are vectors than arrays C++?

The time difference is 0.333 seconds. Or a difference of 0.000000000333 per array access. Assuming a 2.33 GHz Processor like mine that's 0.7 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses.


2 Answers

First up, the amount of runs you are using is simply too few. Personally, I did not realize (before running the code) that your "Duration" measurements are in miliseconds

By increasing the RUNS to 5,000,000 for DotPSimple and DotPComplex the timing are something like:

DotP test with C array:

Duration: 1074.89

DotP test with C++ array:

Duration: 1085.34

That is, they are very close to being equally fast. In fact, which ever was the fastest differed from test to test, due to the stochastic nature of a benchmark. The same was true for MatMultSimple and MatMultComplex, though they only needed 50,000 runs.

If you really want to measure and know more, you should embrace the stochastic nature of this benchmark and approximate the distributions of the "Duration" measurements. Including a random order of the functions, to remove any ordering bias.

EDIT: The assembly code (from user2079303's answer) prove outright that there are no differences with optimization enabled. Thus, the zero-cost abstractions are in fact zero-cost with optimization enabled, which is a reasonable requirement.

UPDATE:

The compiler I used:

g++ (Debian 6.3.0-6) 6.3.0 20170205

With the following command:

g++ -Wall -Wextra -pedantic -O3 test.cpp

Using this processor:

Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
like image 166
Jonas Avatar answered Sep 19 '22 20:09

Jonas


why ... is much faster without compiler optimization. Why?

For whatever reason the compiler chooses. If you don't let the compiler optimize, then you cannot expect two different pieces of code to have similar performance, even if they have identical behaviour. When optimization is enabled, the compiler may be able to transform the abstracted code into the efficient one, and the performance should be comparable.

The use of std::array involves function calls where the use of a pointer does not. For example, std::array::operator[] is a function, while the subscript operator of a pointer is not. Making a function call is potentially slower than not making a function call. All those function calls can be optimized away (expanded inline), but if you choose to not enable optimization, then the function calls remain.

But for the matrix multiplication there is still a significant speedup when using C arrays.

Probably a quirk in your benchmark, or the compiler. Here both functions have identical assembly, and so have identical performance

EDIT: I concur with Jonas' answer. The benchmark has way too few iterations. Also, it is not possible to say whether the difference between two measurements is significant without repeating the benchmark, and analysing the deviation.


The conclusios are:

  • The C array is not faster than std::array when optimization is enabled. At least not when compiling with clang 3.9.1 as demonstrated by the link. Perhaps your compiler produces different assembly, but I see no reason why it should.

  • The zero cost abstractions of C++ are zero cost only after optimization.

  • Writing meaningful micro benchmarks is not trivial.

like image 12
eerorika Avatar answered Sep 19 '22 20:09

eerorika