I'm trying to create a local array of some POD values (e.g. double
) with fixed max_size
that is known at compile time, then read a runtime size
value (size <= max_size
) and process first size
elements from that array.
The question is, why doesn't compiler eliminate stack reads and writes when arr
and size
are placed into the same struct
/class
, as opposed to the case where arr
and size
are independent local variables?
Here's my code:
#include <cstddef>
constexpr std::size_t max_size = 64;
extern void process_value(double& ref_value);
void test_distinct_array_and_size(std::size_t size)
{
double arr[max_size];
std::size_t arr_size = size;
for (std::size_t i = 0; i < arr_size; ++i)
process_value(arr[i]);
}
void test_array_and_size_in_local_struct(std::size_t size)
{
struct
{
double arr[max_size];
std::size_t size;
} array_wrapper;
array_wrapper.size = size;
for (std::size_t i = 0; i < array_wrapper.size; ++i)
process_value(array_wrapper.arr[i]);
}
Assembly output for test_distinct_array_and_size
from Clang with -O3:
test_distinct_array_and_size(unsigned long): # @test_distinct_array_and_size(unsigned long)
push r14
push rbx
sub rsp, 520
mov r14, rdi
test r14, r14
je .LBB0_3
mov rbx, rsp
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov rdi, rbx
call process_value(double&)
add rbx, 8
dec r14
jne .LBB0_2
.LBB0_3:
add rsp, 520
pop rbx
pop r14
ret
Assembly output for test_array_and_size_in_local_struct
:
test_array_and_size_in_local_struct(unsigned long): # @test_array_and_size_in_local_struct(unsigned long)
push r14
push rbx
sub rsp, 520
mov qword ptr [rsp + 512], rdi
test rdi, rdi
je .LBB1_3
mov r14, rsp
xor ebx, ebx
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov rdi, r14
call process_value(double&)
inc rbx
add r14, 8
cmp rbx, qword ptr [rsp + 512]
jb .LBB1_2
.LBB1_3:
add rsp, 520
pop rbx
pop r14
ret
Latest GCC and MSVC compilers do basically the same thing with the stack reads and writes.
As we can see, reads and writes to the array_wrapper.size
variable on the stack are not optimized away in the latter case. There is a write of size
value into location [rsp + 512]
before the beginning of the loop, and a read from that location after each iteration.
So, the compiler kinda expects that we'd want to modify array_wrapper.size
from the process_value(array_wrapper.arr[i])
call (by taking the address of the current array element and applying some weird offsets to it?)
But, if we tried to do so from that call, woudn't that be undefined behavior?
When we rewrite the loop in the following manner
for (std::size_t i = 0, sz = array_wrapper.size; i < sz; ++i)
process_value(array_wrapper.arr[i]);
, those unnecessary reads at the end of each iteration will be gone. But the initial write to [rsp + 512]
will remain, meaning that compiler still expects us to be able to access the array_wrapper.size
variable in that location from these process_value
calls (by doing some weird offset-based magic).
Why?
Is that but a little shortcoming in modern compilers' implementations (that hopefully will be fixed soon)? Or does the C++ standard indeed require such behavior that leads to generation of less efficient code whenever we put an array and its size into the same class?
P.S.
I realize that my code example above might seem a bit contrived. But consider this: I'd like to use a lightweight boost::container::static_vector
-like class template in my code for safer and more convenient "C++-style" manipulations with pseudo-dynamic arrays of POD elements. So my PODVector
will contain an array and a size_t
in the same class:
template<typename T, std::size_t MaxSize>
class PODVector
{
static_assert(std::is_pod<T>::value, "T must be a POD type");
private:
T _data[MaxSize];
std::size_t _size = 0;
public:
using iterator = T *;
public:
static constexpr std::size_t capacity() noexcept
{
return MaxSize;
}
constexpr PODVector() noexcept = default;
explicit constexpr PODVector(std::size_t initial_size)
: _size(initial_size)
{
assert(initial_size <= capacity());
}
constexpr std::size_t size() const noexcept
{
return _size;
}
constexpr void resize(std::size_t new_size)
{
assert(new_size <= capacity());
_size = new_size;
}
constexpr iterator begin() noexcept
{
return _data;
}
constexpr iterator end() noexcept
{
return _data + _size;
}
constexpr T & operator[](std::size_t position)
{
assert(position < _size);
return _data[position];
}
};
Usage:
void test_pod_vector(std::size_t size)
{
PODVector<double, max_size> arr(size);
for (double& val : arr)
process_value(val);
}
If the issue described above is indeed forced by C++'s standard (and is not the fault of compiler writers), such PODVector
will never be as efficient as raw usage of an array and an "unrelated" variable for size. And this would be quite bad for C++ as a language which wants zero-overhead abstractions.
Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options. Save this answer.
-O3 : the highest level of optimization possible. It enables optimizations that are expensive in terms of compile time and memory usage. Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.
Using -march=native enables all instruction subsets supported by the local machine (hence the result might not run on different machines). Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
This is because void process_value(double& ref_value);
accepts the argument by reference. The compiler/optimizer assumes aliasing, i.e. that process_value
function can change memory accessible through reference ref_value
and hence that size
member after the array.
The compiler assumes that because the array
and size
are members of one same object array_wrapper
function process_value
can potentially cast the reference to the first element (on the first invocation) to the reference to the object (and store it elsewhere) and cast the object to unsigned char
and read or replace its entire representation. So that after the function returns the state of the object must be reloaded from memory.
When size
is a stand-alone object on the stack the compiler/optimizer assumes that nothing else could possibly have a reference/pointer to it and caches it in a register.
In Chandler Carruth: Optimizing the Emergent Structures of C++ he explains why the optimizers have difficulty when calling functions accepting reference/pointer arguments. Use reference/pointer function arguments only when absolutely necessary.
If you would like to change the value the more performant option is:
double process_value(double value);
And then:
array_wrapper.arr[i] = process_value(array_wrapper.arr[i]);
This change results in optimal assembly:
.L23:
movsd xmm0, QWORD PTR [rbx]
add rbx, 8
call process_value2(double)
movsd QWORD PTR [rbx-8], xmm0
cmp rbx, rbp
jne .L23
Or:
for(double& val : arr)
val = process_value(val);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With