Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do compilers use XMM registers for raw/std arrays but not vectors?

UPDATE: Seems this is a bug for MSVC, but it also happens with ICC version 14, with highest optimizations (/o3) turned on.

UPDATE2: With optimizations turned off with ICC I got:

  • std::array 159,000
  • raw array 117,000
  • vector 162,313

I am comparing the performance of std::array vs raw array vs std::vector using the below code. I have tested using the MSVC 2012 compiler and Intel compiler vs 14, on Win 7 64, with 64-bit compiling. CPU is Intel 3rd generation.

The results are (consistently):

  • std::array ~35,600
  • raw array ~35,600
  • vector ~40,000

When I checked the assembly the compilers choose the XMM registers for the std::array and raw array and therefore presumably some sort of SIMD processing is taking place? However, for the std::vector the regular r8-r15 registers are used.

Assuming I am correct with the above, why aren't the XMM registers used for an std::vector?

Here is the fully-working test code (you will need to increase your default stack reserve size):

#include <iostream>
#include <vector>
#include <array>


const unsigned int noElements = 10000000;
const unsigned int noIterations = 500;

void testVector(){
    volatile unsigned long long sum = 0;
    unsigned long long start = 0;
    unsigned long long finish = 0;
    unsigned int x;
    unsigned int y;

    std::vector<unsigned int> vec;
    vec.resize(noElements);

    start = __rdtscp(&x);
    for(int i=0; i<noIterations; i++){

        for(int i=0; i<noElements; i++){
            vec[i] = i;
        }

        for(int i=0; i<noElements; i++){
            sum += (3 * vec[i]);
        }
    }
    finish = __rdtscp(&y);

    std::cout << "std::vector:\t" << (finish - start)/1000000 << std::endl;
}


void testRawArray(){
    volatile unsigned long long sum = 0;
    unsigned long long start = 0;
    unsigned long long finish = 0;
    unsigned int x;
    unsigned int y;

    unsigned int myRawArray[noElements];

    start = __rdtscp(&x);
    for(int i=0; i<noIterations; i++){

        for(int i=0; i<noElements; i++){
            myRawArray[i] = i;
        }

        for(int i=0; i<noElements; i++){
            sum += (3 * myRawArray[i]);
        }
    }
    finish = __rdtscp(&y);

    std::cout << "raw array: \t" << (finish - start)/1000000 << std::endl;
}

void testStdArray(){
    volatile unsigned long long sum = 0;
    unsigned long long start = 0;
    unsigned long long finish = 0;
    unsigned int x;
    unsigned int y;

    std::array<unsigned int, noElements> myStdArray;

    start = __rdtscp(&x);
    for(int i=0; i<noIterations; i++){

        for(int i=0; i<noElements; i++){
            myStdArray[i] = i;
        }

        for(int i=0; i<noElements; i++){
            sum += (3 * myStdArray[i]);
        }
    }
    finish = __rdtscp(&y);

    std::cout << "std::array: \t" << (finish - start)/1000000 << std::endl;
}


int main(){
    testStdArray();
    testRawArray();
    testVector();
}
like image 822
user997112 Avatar asked Jun 01 '14 17:06

user997112


1 Answers

Here are the results on my computer, with gcc 4.9 and Intel C++ compiler 14. I have changed the code to noElements = 1000000 and noIterations = 1000. Moreover, I have used std::chrono::steady_clock in order to time the loops.

fayard@speed:Desktop$ uname -a
Darwin speed.home 13.2.0 Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64 x86_64

fayard@speed:Desktop$ g++-4.9 --version
g++-4.9 (Homebrew gcc49 4.9.0 --enable-all-languages) 4.9.0
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
fayard@speed:Desktop$ g++-4.9 -std=c++11 -Ofast main-array.cpp -o main
fayard@speed:Desktop$ ./main
std::array:  1891738
raw array:   1889974
std::vector: 1891721

fayard@speed:Desktop$ icpc --version
icpc (ICC) 14.0.3 20140415
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
fayard@speed:Desktop$ icpc -std=c++11 -Ofast main-array.cpp -o main
fayard@speed:Desktop$ ./main
std::array:  1896141
raw array:   1886859
std::vector: 2135880

As you can see, there is no difference for gcc 4.9. For the Intel compiler std::vector is slower that the other ones. If you check the assembly code, you'll find that the loop

for(int i=0; i<noElements; i++){
    vec[i] = i;
}

is vectorized for the C-array, the std::array but not for the std::vector. If you ask the Intel Compiler why, you'll just get

main-array.cpp(23): (col. 9) remark: loop was not vectorized: existence of vector dependence

Of course, there is no dependence in this loop, but the compiler cannot figure it out. Vectorization is a pain with the standard library where all those containers need methods to access the elements and those methods hide some pointer arithmetic. It makes optimisation a nightmare for people who write optimizing compilers. Therefore, what you'll see here depends highly on the compiler you use. Most likely, you'll find changes from version to version.

What you should expect from a "perfect" compiler, is almost the same timing, as gcc does. You should not find any difference in between C-arrays and std::array (they are both allocated on the stack), and you should not see any difference in between C-arrays and std::vector when you deal with huge arrays as the allocation time is not where the CPU takes time. If on the other hand, you compare a huge number of allocation and deallocation of small arrays (for instance some std::array of size 3), the std::array or C-array will blow the std::vector out of the water, as the allocation time (on the heap for the std::vector) will be important.

So the lessons to learn are :

  • In C++11, forget about C-arrays (*)

  • Use std::array for small array (whose size is known at compile time) and std::vector for big ones

  • If you need a huge stack, most likely you are doing something stupid

  • Fortran rocks because it does not have all these problems. It has one type of array thtat can be allocated on the stack, the heap, that can do (or not) bound checking and for which vectorizers do work because they don't have to deal with crazy pointers.

Now, to come back to your question : Why do compilers use XMM registers for raw/std arrays but not vectors?

It is not always the case as you can see with gcc. The reason is that writing an optimizing compiler for C++ is a huge pain, and a lot of 'basic' optimization are still not done in many compilers.

(*) : There are still corner cases where they are useful.

like image 62
InsideLoop Avatar answered Oct 06 '22 00:10

InsideLoop