While trying to estimate the difference of performances between push_back
and std::inserter
I run into a very strange performance issue.
Let's consider the following code :
#include <vector>
using container = std::vector<int>;
const int size = 1000000;
const int count = 1000;
#ifdef MYOWNFLAG
void foo(std::insert_iterator<container> ist)
{
for(int i=0; i<size; ++i)
*ist++ = i;
}
#endif
void bar(container& cnt)
{
for(int i=0; i<size; ++i)
cnt.push_back(i);
}
int main()
{
container cnt;
for (int i=0; i<count; ++i)
{
cnt.clear();
bar(cnt);
}
return 0;
}
In this case, no mather whether or not MYOWNFLAG
is defined, the function foo isn't called. However the value of this flag has an impact on the perfomances:
$ g++ -g -pipe -march=native -pedantic -std=c++11 -W -Wall -Wextra -Werror -O3 -o bin/inserter src/inserter.cc && time ./bin/inserter
./bin/inserter 4,73s user 0,00s system 100% cpu 4,728 total
$ g++ -g -pipe -march=native -pedantic -std=c++11 -W -Wall -Wextra -Werror -O3 -o bin/inserter src/inserter.cc -DMYOWNFLAG && time ./bin/inserter
./bin/inserter 2,09s user 0,00s system 99% cpu 2,094 total
Note that if I change the protopyte of foo
to use std::back_insert_iterator
I get a similar performance as if I had not set the flag.
What's going on with the compiler's optimisations ???
I use gcc 4.9.2 20150304 (prerelease)
Each function that you create has a memory footprint. While this footprint is usually small, having too many functions within a function app can lead to slower startup of your app on new instances. It also means that the overall memory usage of your function app might be higher.
To find unused members with a Code Analysis Ruleset, from the Visual Studio menu select File -> New -> File… -> General -> Code Analysis Rule Set. Uncheck all the rules. There are many rules we don’t care about right now – and some we probably won’t ever care about.
So, we can safely eliminate unused public members in this program. We just can’t find them through Visual Studio rulesets. To find unused members with a Code Analysis Ruleset, from the Visual Studio menu select File -> New -> File… -> General -> Code Analysis Rule Set.
To find unused members with a Code Analysis Ruleset, from the Visual Studio menu select File -> New -> File… -> General -> Code Analysis Rule Set. Uncheck all the rules.
First I will show you magical trick how to achieve this without garbage function. Then I will show you why garbage function works. So trick:
Original ineffective (note my machine about twice faster):
g++ -g -pipe -march=native -pedantic -std=c++11 -W -Wall -Wextra -Werror -O3 -o bin/inserter src/inserter.cc --param inline-unit-growth=200 && time ./bin/inserter
real 0m2.197s
user 0m2.200s
sys 0m0.000s
Now goes trick (your define is still inactive):
g++ -g -pipe -march=native -pedantic -std=c++11 -W -Wall -Wextra -Werror -O3 -o bin/inserter src/inserter.cc --param inline-min-speedup=2 && time ./bin/inserter
real 0m1.114s
user 0m1.100s
sys 0m0.010s
Note: difference is in strange-looking argument --param inline-min-speedup=2
Now I will briefly outline investigation:
What is difference between fast and slow? In slow version we do have ineffective call to emplace_back_aux
inside bar()
, that is magically inlined when your foo is uncommented. So we may conclude, that bar is very hot and inlining is crushial here. And most probably all this bug is about inlining.
Now with option -fdump-ipa-inline-details
lets look at inlining dumps. You will see different time/size considerations. It is hard to read and I don't want to paste here all details. But general result of studying this information: GCC thinks, that growth in module size (in percents) is not worth estimated speedup.
What to do? Two possibilities:
3.1. Either increase module size and overall speedup estimations with unused foo
code, that is using correct types like insert_iterator to call emplace_back and move ratio to be bigger and hit inlining limit (note that this way is very unstable -- everything may explode in other compiler versions with improved inlining algorithms, and you also need to be really lucky to guess the code to work).
3.2. Or move inlining limit. What did I said to GCC with parameter provided is "consider for inlining even big functions with less speedup please".
That is. There are a lot of other parameters inside GCC and other tricks that you may do with them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With