Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why don't C++ compilers optimize away reads and writes to struct data members as opposed to distinct local variables?

I'm trying to create a local array of some POD values (e.g. double) with fixed max_size that is known at compile time, then read a runtime size value (size <= max_size) and process first size elements from that array.

The question is, why doesn't compiler eliminate stack reads and writes when arr and size are placed into the same struct/class, as opposed to the case where arr and size are independent local variables?

Here's my code:

#include <cstddef>
constexpr std::size_t max_size = 64;

extern void process_value(double& ref_value);

void test_distinct_array_and_size(std::size_t size)
{
    double arr[max_size];
    std::size_t arr_size = size;

    for (std::size_t i = 0; i < arr_size; ++i)
        process_value(arr[i]);
}

void test_array_and_size_in_local_struct(std::size_t size)
{
    struct
    {
        double arr[max_size];
        std::size_t size;
    } array_wrapper;
    array_wrapper.size = size;

    for (std::size_t i = 0; i < array_wrapper.size; ++i)
        process_value(array_wrapper.arr[i]);
}

Assembly output for test_distinct_array_and_size from Clang with -O3:

test_distinct_array_and_size(unsigned long): # @test_distinct_array_and_size(unsigned long)
  push r14
  push rbx
  sub rsp, 520
  mov r14, rdi
  test r14, r14
  je .LBB0_3
  mov rbx, rsp
.LBB0_2: # =>This Inner Loop Header: Depth=1
  mov rdi, rbx
  call process_value(double&)
  add rbx, 8
  dec r14
  jne .LBB0_2
.LBB0_3:
  add rsp, 520
  pop rbx
  pop r14
  ret

Assembly output for test_array_and_size_in_local_struct:

test_array_and_size_in_local_struct(unsigned long): # @test_array_and_size_in_local_struct(unsigned long)
  push r14
  push rbx
  sub rsp, 520
  mov qword ptr [rsp + 512], rdi
  test rdi, rdi
  je .LBB1_3
  mov r14, rsp
  xor ebx, ebx
.LBB1_2: # =>This Inner Loop Header: Depth=1
  mov rdi, r14
  call process_value(double&)
  inc rbx
  add r14, 8
  cmp rbx, qword ptr [rsp + 512]
  jb .LBB1_2
.LBB1_3:
  add rsp, 520
  pop rbx
  pop r14
  ret

Latest GCC and MSVC compilers do basically the same thing with the stack reads and writes.

As we can see, reads and writes to the array_wrapper.size variable on the stack are not optimized away in the latter case. There is a write of size value into location [rsp + 512] before the beginning of the loop, and a read from that location after each iteration.

So, the compiler kinda expects that we'd want to modify array_wrapper.size from the process_value(array_wrapper.arr[i]) call (by taking the address of the current array element and applying some weird offsets to it?)

But, if we tried to do so from that call, woudn't that be undefined behavior?

When we rewrite the loop in the following manner

for (std::size_t i = 0, sz = array_wrapper.size; i < sz; ++i)
    process_value(array_wrapper.arr[i]);

, those unnecessary reads at the end of each iteration will be gone. But the initial write to [rsp + 512] will remain, meaning that compiler still expects us to be able to access the array_wrapper.size variable in that location from these process_value calls (by doing some weird offset-based magic).

Why?

Is that but a little shortcoming in modern compilers' implementations (that hopefully will be fixed soon)? Or does the C++ standard indeed require such behavior that leads to generation of less efficient code whenever we put an array and its size into the same class?

P.S.

I realize that my code example above might seem a bit contrived. But consider this: I'd like to use a lightweight boost::container::static_vector-like class template in my code for safer and more convenient "C++-style" manipulations with pseudo-dynamic arrays of POD elements. So my PODVector will contain an array and a size_t in the same class:

template<typename T, std::size_t MaxSize>
class PODVector
{
    static_assert(std::is_pod<T>::value, "T must be a POD type");

private:
    T _data[MaxSize];
    std::size_t _size = 0;

public:
    using iterator = T *;

public:
    static constexpr std::size_t capacity() noexcept
    {
        return MaxSize;
    }

    constexpr PODVector() noexcept = default;

    explicit constexpr PODVector(std::size_t initial_size)
        : _size(initial_size)
    {
        assert(initial_size <= capacity());
    }

    constexpr std::size_t size() const noexcept
    {
        return _size;
    }

    constexpr void resize(std::size_t new_size)
    {
        assert(new_size <= capacity());
        _size = new_size;
    }

    constexpr iterator begin() noexcept
    {
        return _data;
    }

    constexpr iterator end() noexcept
    {
        return _data + _size;
    }

    constexpr T & operator[](std::size_t position)
    {
        assert(position < _size);
        return _data[position];
    }
};

Usage:

void test_pod_vector(std::size_t size)
{
    PODVector<double, max_size> arr(size);

    for (double& val : arr)
        process_value(val);
}

If the issue described above is indeed forced by C++'s standard (and is not the fault of compiler writers), such PODVector will never be as efficient as raw usage of an array and an "unrelated" variable for size. And this would be quite bad for C++ as a language which wants zero-overhead abstractions.

like image 419
Taras Avatar asked Jan 17 '18 14:01

Taras


People also ask

How do I disable compiler optimization?

Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options. Save this answer.

What is O3 in C?

-O3 : the highest level of optimization possible. It enables optimizations that are expensive in terms of compile time and memory usage. Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.

What does March native do?

Using -march=native enables all instruction subsets supported by the local machine (hence the result might not run on different machines). Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.


1 Answers

This is because void process_value(double& ref_value); accepts the argument by reference. The compiler/optimizer assumes aliasing, i.e. that process_value function can change memory accessible through reference ref_value and hence that size member after the array.

The compiler assumes that because the array and size are members of one same object array_wrapper function process_value can potentially cast the reference to the first element (on the first invocation) to the reference to the object (and store it elsewhere) and cast the object to unsigned char and read or replace its entire representation. So that after the function returns the state of the object must be reloaded from memory.

When size is a stand-alone object on the stack the compiler/optimizer assumes that nothing else could possibly have a reference/pointer to it and caches it in a register.

In Chandler Carruth: Optimizing the Emergent Structures of C++ he explains why the optimizers have difficulty when calling functions accepting reference/pointer arguments. Use reference/pointer function arguments only when absolutely necessary.

If you would like to change the value the more performant option is:

double process_value(double value);

And then:

array_wrapper.arr[i] = process_value(array_wrapper.arr[i]);

This change results in optimal assembly:

.L23:
movsd xmm0, QWORD PTR [rbx]
add rbx, 8
call process_value2(double)
movsd QWORD PTR [rbx-8], xmm0
cmp rbx, rbp
jne .L23

Or:

for(double& val : arr)
    val = process_value(val);
like image 122
Maxim Egorushkin Avatar answered Oct 05 '22 00:10

Maxim Egorushkin