Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring performance of vector<unique_ptr> on VS2013?

TL;DR Is the optimizer of VS2013 confused or are my measurements wrong or does the global dummy in fact need to be volatile to make the test valid or ____ ?

Disclaimer: This is mostly out of "academic" interest, I would not expect the differences I see to really affect any production code.


Intro: Some recent measurements of mine then led me to this question because I saw significant differences between std::vector<std::unique_ptr<T> > and boost::ptr_vector on VS2013. (also see comments there)

It would appear that for my specific test case, accessing elements in a boost::ptr_vector can be 50% faster than using a vector of unique_ptr!

My Test code is here: http://coliru.stacked-crooked.com/a/27dc2f1b91380cca (I'll refrain from also including it in this question, I'll include snippets below)

  • gcc 4.8 doesn't report any differences, so this is a VS2013 thing.

    Start...
    The timings are as follows for accessing all (1000000) elements 200 times:
    * St6vectorISt10unique_ptrIjSt14default_deleteIjEESaIS3_EE: 1764 ms
    * N5boost10ptr_vectorIjNS_20heap_clone_allocatorESaIPvEEE: 1781 ms
    Dummy output: 500000
    
  • My Timings for exactly the test code linked to are:

    Start...
    The timings are as follows for accessing all (1.000.000) elements 200 times:
    * class std::vector<....>: 344 ms
    * class boost::ptr_vector<unsigned int,....>: 216 ms
    Dummy output: 500.000
    

The test loop looks like this, I'll also keep the lengthy comment there which explains what I see:

template<typename C>
void RunContainerAccess(C& c) {
    for (size_t i = 0; i != loop_count; ++i) {
        for (auto& e : c) {
            // This is relevant: 
            // If the if-condition is present, VC++2013 will show 
            // approx. the same runtime for both cases. However,
            // if the only line in this loop is assigning the element
            // to the pGlobalDummy, *then* ptr_vector will run approx. 50%
            // faster than the unique_vector version!
            //
            // g++-4.8 does not show this behaviour
            //
            // Note: VS2013 commmand line: (release; /O2; no whole prg opt)
            //   /GS /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\simple.pch" 
            //
            // Note: http://coliru.stacked-crooked.com/ command line:
            //   g++-4.8 -std=c++11 -O2 -Wall -pedantic -pthread main.cpp && ./a.out

            // if (pGlobalDummy)
                pGlobalDummy = PtrShim(e);
        }
    }
}

If the only line in the loop is accessing the element (putting the ptr into a global dummy), then it would appear that the VS2013 optimizer does something weird. When the if (pGlobalDummy) is present, both cases are the same.

Can anyone share some light on this?

Thanks to Howard's answer I did find that adding volatile to the global dummy make a difference, i.e. when the global dumy is volatile like so:

extern MyType* volatile pGlobalDummy;
MyType* volatile pGlobalDummy = nullptr;

The loops run a bit slower, but run exactly the same. Should volatile make a difference here? That is, is the test even valid without volatile?

like image 679
Martin Ba Avatar asked Jan 16 '14 20:01

Martin Ba


1 Answers

I found a bug in your test, which gives permission for optimizers to optimize in different and unpredictable ways. I don't know for sure that this is impacting your results. But it sure impacted mine.

I'm using tip-of-trunk clang + libc++ -O3.

When I run your code unmodified I get:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 0 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 0 ms
Dummy output: 500,000

I changed the output units to nanoseconds and got:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 32 ns
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 32 ns
Dummy output: 500,000

Suspicious, I inserted volatile here:

extern MyType* <ins>volatile</ins> pGlobalDummy;
MyType* <ins>volatile</ins> pGlobalDummy = nullptr;

but no change.

Then I noted that time[2] isn't being initialized, so I:

chron::nanoseconds time[2]<ins> = {}</ins>;

That did it. Now setting units back to milliseconds I get:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 394 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 406 ms
Dummy output: 500,000

So I'm curious, if you explicitly zero your time[2], you may need to:

chron::nanoseconds time[2] = {chron::nanoseconds(0), chron::nanoseconds(0)};

does this impact the results you are seeing?

Clarification

The std::chrono::duration default constructor is specified as:

constexpr duration() = default;

This will default-initialize the duration's rep if the client does not specify list-initialization, e.g.:

chrono::nanoseconds ns;  // default-initialized

When rep is an arithmetic type, no initialization is performed ([dcl.init]/p7/b3).

If the client list-initializes, e.g.:

chrono::nanoseconds ns{};  // list-initialized

Then rep is value-initialized ([dcl.init.list]/p3/b7), and for arithmetic types, value-initialization is the same thing as zero-initialization ([dcl.init]/p8/b4).

Full working example:

#include <iostream>
#include <chrono>

int
main()
{
    std::chrono::nanoseconds n1;
    std::chrono::nanoseconds n2{};
    std::chrono::nanoseconds n3 = {};
    std::cout << "n1 = " << n1.count() << "ns\n";
    std::cout << "n2 = " << n2.count() << "ns\n";
    std::cout << "n3 = " << n3.count() << "ns\n";
}

For me, when compiled with -O0 I get:

n1 = 0ns
n2 = 0ns
n3 = 0ns

But compiling the same thing with -O3, this changes to:

n1 = 32ns
n2 = 0ns
n3 = 0ns
like image 187
Howard Hinnant Avatar answered Oct 14 '22 00:10

Howard Hinnant