I'm compiling a C++ library which defines a single function that randomly samples from a set of data points. The data points are stored in a std::vector
. There are 126,272 std::vector
push_back statements, where the vector in question is of type double
. It is taking a long time to compile.
Why would this take so long? (All the code other than the std::vector
push_back statements would take less than 1 second to compile, because there's very little other code.)
push_back(): Inserts a new element at the end of the vector. Its time complexity is O(1).
The main reason why push_back is slow is because of multiple reallocation of memory. Every vector has vector::size and vector::capacity. vector::size gives the number of elements in the vector and vector::capacity gives the number of elements vector can store.
A std::vector can never be faster than an array, as it has (a pointer to the first element of) an array as one of its data members. But the difference in run-time speed is slim and absent in any non-trivial program. One reason for this myth to persist, are examples that compare raw arrays with mis-used std::vectors.
That's about 3 - 4 times slower!
There is the -ftime-report
option in gcc which prints the detailed report of time wasted by each compiler phase.
I'm used ubuntu 12.04 64-bit with gcc 4.6.3 and this code to reproduce your situation:
#include <vector>
using namespace std;
int main()
{
vector<double> d;
d.push_back(5.7862517058766);
/* ... N lines generated with
perl -e 'print(" d.push_back(",rand(10),");\n") for 1..100000'
*/
d.push_back(3.77195464257674);
return d.size();
}
There are -ftime-report
outputs for various N (wall
time was inaccurate because of background load on the PC, so look on user time
, usr
):
N=10000
$ g++ -ftime-report ./pb10k.cpp
Execution times (seconds)
...
expand vars : 1.48 (47%) usr 0.01 ( 7%) sys 1.49 (44%) wall 1542 kB ( 2%) ggc
expand : 0.11 ( 3%) usr 0.01 ( 7%) sys 0.10 ( 3%) wall 19187 kB (30%) ggc
...
TOTAL : 3.18 0.15 3.35 64458 kB
N=100000
$ g++ -ftime-report ./pb100k.cpp
Execution times (seconds)
....
preprocessing : 0.49 ( 0%) usr 0.28 ( 5%) sys 0.59 ( 0%) wall 6409 kB ( 1%) ggc
parser : 0.96 ( 0%) usr 0.39 ( 6%) sys 1.41 ( 0%) wall 108217 kB (18%) ggc
name lookup : 0.06 ( 0%) usr 0.07 ( 1%) sys 0.24 ( 0%) wall 1023 kB ( 0%) ggc
inline heuristics : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 0 kB ( 0%) ggc
integration : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 4095 kB ( 1%) ggc
tree gimplify : 0.22 ( 0%) usr 0.00 ( 0%) sys 0.23 ( 0%) wall 36068 kB ( 6%) ggc
tree eh : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall 5678 kB ( 1%) ggc
tree CFG construction : 0.08 ( 0%) usr 0.01 ( 0%) sys 0.10 ( 0%) wall 38544 kB ( 7%) ggc
....
expand vars : 715.98 (97%) usr 1.62 (27%) sys 718.32 (83%) wall 18359 kB ( 3%) ggc
expand : 1.04 ( 0%) usr 0.09 ( 1%) sys 1.64 ( 0%) wall 190836 kB (33%) ggc
post expand cleanups : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.15 ( 0%) wall 43 kB ( 0%) ggc
....
rest of compilation : 1.94 ( 0%) usr 2.56 (43%) sys 102.42 (12%) wall 63620 kB (11%) ggc
TOTAL : 739.68 6.01 866.46 586293 kB
So, there is some extra work for huge N in "expand vars" phase. This phase is exactly in this line: cfgexpand.c:4463 (between TV_VAR_EXPAND macro).
The interesting fact: I have very short compilation times with my custom compiled 32-bit g++ 4.6.2 (~20 sec for N = 100000).
What is the difference between my g++ and ubuntu g++? The one is turning on by default the Gcc Stack Protection (-fstack-protect
option) in Ubuntu. And this protection is added just to "expand vars" phase (found in the sources cfgexpand.c:1644,expand_used_vars(); mentioned here):
N=100000, stack protector disabled with option -fno-stack-protector
(use it for your code):
$ g++ -ftime-report -fno-stack-protector pb100k.cpp 2>&1 |egrep 'TOTAL|expand vars'
expand vars : 0.08 ( 0%) usr 0.01 ( 1%) sys 0.09 ( 0%) wall 18359 kB ( 3%) ggc
TOTAL : 23.05 1.48 24.60 586293 kB
Running time is 24 seconds, down from 800.
UPDATE:
After starting gcc inside callgrind
(call-graph profiling tool from Valgrind), I can say that there are N stack variables. If stack protector is enabled, they are handled in "expand vars" phase with three O(N^2) algorithms. Actually there are N^2 successful conflict detections and 1,5 * N^2 bit manipulations done plus some nested loop logic.
Why number of stack variables is so high? Because every double constant in your code is saved to different slot in the stack. Then it is loaded from its slot and passed as calling convention says (via top of stack in x86; via registers in x86_64). The funny fact: all of code with push_back
s compiled with -fstack-protector
or with -fno-stack-protector
is the same; stack layout of constants is same too. Only some stack layout offsets of non-push_back code are affected (checked two runs with -S
and diff -u
). No additional code was created by enabled stack protector.
Enabling of stack protector fatally changes some behaviour inside the compiler. Can't say where exactly (note: it is possible to find this turning point with comparing of stack traces with callgraph.tar.gz by Juan M. Bello Rivas).
First big N*(N+1)/2 = O(N^2) walk is in expand_used_vars_for_block (tree block, level)
function to set info about conflicts between pairs of stack variables:
/* Since we do not track exact variable lifetimes (which is not even
possible for variables whose address escapes), we mirror the block
tree in the interference graph. Here we cause all variables at this
level, and all sublevels, to conflict. */
if (old_sv_num < this_sv_num)
{
new_sv_num = stack_vars_num;
for (i = old_sv_num; i < new_sv_num; ++i)
for (j = i < this_sv_num ? i : this_sv_num; j-- > old_sv_num ;)
add_stack_var_conflict (i, j);
}
}
The add_stack_var_conflict(i,j)
turns to
There is second N^2 walk in add_alias_set_conflicts
. It does type checks for every pair with objects_must_conflict_p
. It checks, if two variables are of the same type (most pairs are; this is Type-based alias analysis, TBAA). If not, add_stack_var_conflict
is called; there is only N such calls from this N^2 loop nest.
Last huge walk is in partition_stack_vars()
function with qsort
ing of stack vars ( O(NlogN) ) and N*(N-1)/2 = O(N^2) walk to find all non-conflicting pairs. Here is pseudocode of partition_stack_vars
from cfgexpand.c file:
Sort the objects by size.
For each object A {
S = size(A)
O = 0
loop {
Look for the largest non-conflicting object B with size <= S.
/* There is a call to stack_var_conflict_p to check for
* conflict between 2 vars */
UNION (A, B)
offset(B) = O
O += size(B)
S -= size(B)
}
}
Function stack_var_conflict_p
just checks is there conflict bitmask in some i-th variable and is there j-th bit set as conflict flag with j-th variable (with call to bitmap_bit_p(i->conflict_mask,j)
). The really bad news here is, that callgrind says that every conflict check was successful, and the UNION logic is skipped for every pair.
So, a lot of time is wasted by O(N^2) bit sets and O(N^2/2) bit checks; and all this work don't help to optimize anything. And yes, this all is part of -O0
and triggered by -fstack-protector
.
UPDATE2:
Seems, the turning point is expand_one_var
cfgexpand.c from 4.6, in the check for immediate or deferred allocating of variable on the stack:
1110 else if (defer_stack_allocation (var, toplevel))
1111 add_stack_var (origvar);
1112 else
1113 {
1114 if (really_expand)
1115 expand_one_stack_var (origvar);
1116 return tree_low_cst (DECL_SIZE_UNIT (var), 1);
1117 }
(expand_one_stack_var was called here only in fast variant, according to callgrind)
Deferred allocation is forced when -fstack-protect
is enabled (sometimes it needs to reorder all stack variables). There even a comment about some "quadratic problem", which is looks too familiar to us now:
969 /* A subroutine of expand_one_var. VAR is a variable that will be
970 allocated to the local stack frame. Return true if we wish to
971 add VAR to STACK_VARS so that it will be coalesced with other
972 variables. Return false to allocate VAR immediately.
973
974 This function is used to reduce the number of variables considered
975 for coalescing, which reduces the size of the quadratic problem. */
976
977 static bool
978 defer_stack_allocation (tree var, bool toplevel)
979 {
980 /* If stack protection is enabled, *all* stack variables must be deferred,
981 so that we can re-order the strings to the top of the frame. */
982 if (flag_stack_protect)
983 return true;
(stack allocation is deferred too at -O2
and greater)
Here is a commit: http://gcc.gnu.org/ml/gcc-patches/2005-05/txt00029.txt which added this logic.
This question was completely answered by the great answer from osgx.
Maybe one additional aspect: push_back()
vs initialization list
When running the above test with 100000 push_backs, I get the following result with a gcc 4.4.6 on a Debian 6.0.6 system:
$ time g++ -std=c++0x -ftime-report ./pb100k.cc
Execution times (seconds)
garbage collection : 0.55 ( 1%) usr 0.00 ( 0%) sys 0.55 ( 1%) wall 0 kB ( 0%) ggc
...
reload : 33.95 (58%) usr 0.13 ( 6%) sys 34.14 (56%) wall 65723 kB ( 9%) ggc
thread pro- & epilogue: 0.66 ( 1%) usr 0.00 ( 0%) sys 0.66 ( 1%) wall 84 kB ( 0%) ggc
final : 1.82 ( 3%) usr 0.01 ( 0%) sys 1.81 ( 3%) wall 21 kB ( 0%) ggc
TOTAL : 58.65 2.13 60.92 737584 kB
real 1m2.804s
user 1m0.348s
sys 0m2.328s
When using an initialization list, it is much faster:
$ cat pbi100k.cc
#include <vector>
using namespace std;
int main()
{
vector<double> d {
0.190987822870774,
/* 100000 lines with doubles generated with:
perl -e 'print(rand(10),",\n") for 1..100000'
*/
7.45608614801021};
return d.size();
}
$ time g++ -std=c++0x -ftime-report ./pbi100k.cc
Execution times (seconds)
callgraph construction: 0.02 ( 2%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 25 kB ( 0%) ggc
preprocessing : 0.72 (59%) usr 0.06 (25%) sys 0.80 (54%) wall 8004 kB (12%) ggc
parser : 0.24 (20%) usr 0.12 (50%) sys 0.36 (24%) wall 43185 kB (65%) ggc
name lookup : 0.01 ( 1%) usr 0.05 (21%) sys 0.03 ( 2%) wall 1447 kB ( 2%) ggc
tree gimplify : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 277 kB ( 0%) ggc
tree find ref. vars : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 15 kB ( 0%) ggc
varconst : 0.19 (15%) usr 0.01 ( 4%) sys 0.20 (14%) wall 11288 kB (17%) ggc
integrated RA : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 74 kB ( 0%) ggc
reload : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 61 kB ( 0%) ggc
TOTAL : 1.23 0.24 1.48 66378 kB
real 0m1.701s
user 0m1.416s
sys 0m0.276s
This is about 30+ times faster!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With