TL;DR Is the optimizer of VS2013 confused or are my measurements wrong or does the global dummy in fact need to be volatile to make the test valid or ____ ?
Disclaimer: This is mostly out of "academic" interest, I would not expect the differences I see to really affect any production code.
Intro: Some recent measurements of mine then led me to this question because I saw significant differences between std::vector<std::unique_ptr<T> >
and boost::ptr_vector
on VS2013. (also see comments there)
It would appear that for my specific test case, accessing elements in a boost::ptr_vector can be 50% faster than using a vector of unique_ptr!
My Test code is here: http://coliru.stacked-crooked.com/a/27dc2f1b91380cca (I'll refrain from also including it in this question, I'll include snippets below)
gcc 4.8 doesn't report any differences, so this is a VS2013 thing.
Start...
The timings are as follows for accessing all (1000000) elements 200 times:
* St6vectorISt10unique_ptrIjSt14default_deleteIjEESaIS3_EE: 1764 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorESaIPvEEE: 1781 ms
Dummy output: 500000
My Timings for exactly the test code linked to are:
Start...
The timings are as follows for accessing all (1.000.000) elements 200 times:
* class std::vector<....>: 344 ms
* class boost::ptr_vector<unsigned int,....>: 216 ms
Dummy output: 500.000
The test loop looks like this, I'll also keep the lengthy comment there which explains what I see:
template<typename C>
void RunContainerAccess(C& c) {
for (size_t i = 0; i != loop_count; ++i) {
for (auto& e : c) {
// This is relevant:
// If the if-condition is present, VC++2013 will show
// approx. the same runtime for both cases. However,
// if the only line in this loop is assigning the element
// to the pGlobalDummy, *then* ptr_vector will run approx. 50%
// faster than the unique_vector version!
//
// g++-4.8 does not show this behaviour
//
// Note: VS2013 commmand line: (release; /O2; no whole prg opt)
// /GS /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\simple.pch"
//
// Note: http://coliru.stacked-crooked.com/ command line:
// g++-4.8 -std=c++11 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
// if (pGlobalDummy)
pGlobalDummy = PtrShim(e);
}
}
}
If the only line in the loop is accessing the element (putting the ptr into a global dummy), then it would appear that the VS2013 optimizer does something weird. When the if (pGlobalDummy)
is present, both cases are the same.
Can anyone share some light on this?
Thanks to Howard's answer I did find that adding volatile
to the global dummy make a difference, i.e. when the global dumy is volatile like so:
extern MyType* volatile pGlobalDummy;
MyType* volatile pGlobalDummy = nullptr;
The loops run a bit slower, but run exactly the same. Should volatile make a difference here? That is, is the test even valid without volatile?
I found a bug in your test, which gives permission for optimizers to optimize in different and unpredictable ways. I don't know for sure that this is impacting your results. But it sure impacted mine.
I'm using tip-of-trunk clang + libc++ -O3.
When I run your code unmodified I get:
Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 0 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 0 ms
Dummy output: 500,000
I changed the output units to nanoseconds and got:
Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 32 ns
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 32 ns
Dummy output: 500,000
Suspicious, I inserted volatile here:
extern MyType* <ins>volatile</ins> pGlobalDummy;
MyType* <ins>volatile</ins> pGlobalDummy = nullptr;
but no change.
Then I noted that time[2]
isn't being initialized, so I:
chron::nanoseconds time[2]<ins> = {}</ins>;
That did it. Now setting units back to milliseconds I get:
Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 394 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 406 ms
Dummy output: 500,000
So I'm curious, if you explicitly zero your time[2]
, you may need to:
chron::nanoseconds time[2] = {chron::nanoseconds(0), chron::nanoseconds(0)};
does this impact the results you are seeing?
Clarification
The std::chrono::duration
default constructor is specified as:
constexpr duration() = default;
This will default-initialize the duration
's rep
if the client does not specify list-initialization, e.g.:
chrono::nanoseconds ns; // default-initialized
When rep
is an arithmetic type, no initialization is performed ([dcl.init]/p7/b3).
If the client list-initializes, e.g.:
chrono::nanoseconds ns{}; // list-initialized
Then rep
is value-initialized ([dcl.init.list]/p3/b7), and for arithmetic types, value-initialization is the same thing as zero-initialization ([dcl.init]/p8/b4).
Full working example:
#include <iostream>
#include <chrono>
int
main()
{
std::chrono::nanoseconds n1;
std::chrono::nanoseconds n2{};
std::chrono::nanoseconds n3 = {};
std::cout << "n1 = " << n1.count() << "ns\n";
std::cout << "n2 = " << n2.count() << "ns\n";
std::cout << "n3 = " << n3.count() << "ns\n";
}
For me, when compiled with -O0 I get:
n1 = 0ns
n2 = 0ns
n3 = 0ns
But compiling the same thing with -O3, this changes to:
n1 = 32ns
n2 = 0ns
n3 = 0ns
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With