I claimed to a coworker that if (i < input.size() - 1) print(0);
would get optimized in this loop so that input.size()
is not read in every iteration, but it turns out that this is not the case!
void print(int x) {
std::cout << x << std::endl;
}
void print_list(const std::vector<int>& input) {
int i = 0;
for (size_t i = 0; i < input.size(); i++) {
print(input[i]);
if (i < input.size() - 1) print(0);
}
}
According to the Compiler Explorer with gcc options -O3 -fno-exceptions
we are actually reading input.size()
each iteration and using lea
to perform a subtraction!
movq 0(%rbp), %rdx
movq 8(%rbp), %rax
subq %rdx, %rax
sarq $2, %rax
leaq -1(%rax), %rcx
cmpq %rbx, %rcx
ja .L35
addq $1, %rbx
Interestingly, in Rust this optimization does occur. It looks like i
gets replaced with a variable j
that is decremented each iteration, and the test i < input.size() - 1
is replaced with something like j > 0
.
fn print(x: i32) {
println!("{}", x);
}
pub fn print_list(xs: &Vec<i32>) {
for (i, x) in xs.iter().enumerate() {
print(*x);
if i < xs.len() - 1 {
print(0);
}
}
}
In the Compiler Explorer the relevant assembly looks like this:
cmpq %r12, %rbx
jae .LBB0_4
I checked and I am pretty sure r12
is xs.len() - 1
and rbx
is the counter. Earlier there is an add
for rbx
and a mov
outside of the loop into r12
.
Why is this? It seems like if GCC is able to inline the size()
and operator[]
as it did, it should be able to know that size()
does not change. But maybe GCC's optimizer judges that it is not worth pulling it out into a variable? Or maybe there is some other possible side effect that would make this unsafe--does anyone know?
The non-inline function call to cout.operator<<(int)
is a black box for the optimizer (because the library is just written in C++ and all the optimizer sees is a prototype; see discussion in comments). It has to assume any memory that could possibly be pointed to by a global var has been modified.
(Or the std::endl
call. BTW, why force a flush of cout at that point instead of just printing a '\n'
?)
e.g. for all it knows, std::vector<int> &input
is a reference to a global variable, and one of those function calls modifies that global var. (Or there's a global vector<int> *ptr
somewhere, or there's a function that returns a pointer to a static vector<int>
in some other compilation unit, or some other way that a function could get a reference to this vector without being passed a reference to it by us.
If you had a local variable whose address had never been taken, the compiler could assume that non-inline function calls couldn't mutate it. Because there'd be no way for any global variable to hold a pointer to this object. (This is called Escape Analysis). That's why the compiler can keep size_t i
in a register across function calls. (int i
can just get optimized away because it's shadowed by size_t i
and not used otherwise).
It could do the same with a local vector
(i.e. for the base, end_size and end_capacity pointers.)
ISO C99 has a solution for this problem: int *restrict foo
. Many C++ compiles support int *__restrict foo
to promise that memory pointed to by foo
is only accessed via that pointer. Most commonly useful in functions that take 2 arrays, and you want to promise the compiler they don't overlap. So it can auto-vectorize without generating code to check for that and run a fallback loop.
The OP comments:
In Rust a non-mutable reference is a global guarantee that no one else is mutating the value you have a reference to (equivalent to C++
restrict
)
That explains why Rust can make this optimization but C++ can't.
Obviously you should use auto size = input.size();
once at the top of your function so the compiler knows it's a loop invariant. C++ implementations don't solve this problem for you, so you have to do it yourself.
You might also need const int *data = input.data();
to hoist loads of the data pointer from the std::vector<int>
"control block" as well. It's unfortunate that optimizing can require very non-idiomatic source changes.
Rust is a much more modern language, designed after compiler developers learned what was possible in practice for compilers. It really shows in other ways, too, including portably exposing some of the cool stuff CPUs can do via i32.count_ones
, rotate, bit-scan, etc. It's really dumb that ISO C++ still doesn't expose any of these portably, except std::bitset::count()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With