Why is the compiler loading this pointer from memory in a loop

Question

I am trying to determine what overhead std::atomic introduces to unconditional memory writes on my system (eight-core x64). Here is my benchmark program:

#include <atomic>
#include <iostream>
#include <omp.h>

int main() {
    std::atomic_int foo(0); // VERSION 1
    //volatile int foo = 0; // VERSION 2

    #pragma omp parallel
    for (unsigned int i = 0; i < 10000000; ++i) {
        foo.store(i, std::memory_order_relaxed); // VERSION 1
        //foo = i; // VERSION 2
    }

    std::cout << foo << std::endl;
}

The program as-is will benchmark std::atomic_int, and commenting the lines labeled VERSION 1 and uncommenting the lines labeled VERSION 2 will test volatile int in its place. Even when unsynchronized, the output of both programs should be 10000000 - 1.

This is my command-line:

g++ -O2 -std=c++11 -fopenmp test.c++

The version that uses atomic_int takes between two and three seconds on my system, while the one that uses volatile int almost always completes in less than a tenth of a second.

The salient difference in the assembly is this (output from diff --side-by-side):

volatile int                        atomic_int
.L2:                                .L2:
    mov DWORD PTR [rdi], eax          | mov rdx, QWORD PTR [rdi]
                                      > mov DWORD PTR [rdx], eax
    add eax, 1                          add eax, 1
    cmp eax, 10000000                   cmp eax, 10000000
    jne .L2                             jne .L2
    rep ret                             rep ret

rdi is the first argument to this function that gets run in parallel (it is not modified anywhere in the function), and it is apparently a pointer to (a pointer to, in the second column) the integer foo. I do not believe that this extra mov is integral to the atomicity guarantee of atomic_int.

The extra mov is indeed the source of the slowdown for atomic_int; moving it above L2 allows both versions to achieve the same performance and both output the correct number.

When foo is made a global variable, atomic_int attains the same increased performance of volatile int.

My questions are these: Why is the compiler passing a pointer to a pointer in the case of a stack-allocated atomic_int but only a pointer in the case of global atomic_int or stack-allocated volatile int; why is it loading that pointer on every iteration of the loop since it is (I believe) loop-invariant code; and what changes to the C++ source can I make to have atomic_int match volatile int in this benchmark?

Update

Running this program:

#include <atomic>
#include <iostream>
#include <thread>

//using T = volatile int; // VERSION 1
using T = std::atomic_int; // VERSION 2

void foo(T* ptr) {
    for (unsigned int i = 0; i < 10000000; ++i) {
        //*ptr = i; // VERSION 1
        ptr->store(i, std::memory_order_relaxed); // VERSION2
    }
}

int main() {
    T i { 0 };

    std::thread threads[4];

    for (auto& x : threads)
        x = std::move(std::thread { foo, &i });

    for (auto& x : threads)
        x.join();

    std::cout << i << std::endl;
}

yields the same, improved performance for both versions 1 and 2, which leads me to believe that it's a peculiarity of OpenMP that forces the worse perf for atomic_int. Is OpenMP correct, or is it generating suboptimal code?

Hristo Iliev · Accepted Answer

Things get much easier to understand if you look at the intermediate representation (-fdump-tree-all is your friend there) of the program rather than at the assembly output.

Why is the compiler passing a pointer to a pointer in the case of a stack-allocated atomic_int but only a pointer in the case of global atomic_int or stack-allocated volatile int;

This is an implementation detail. GCC transforms parallel regions by outlining them into separate functions that then receive as their sole argument a structure containing all the shared variables, also the initial value of firstprivate and placeholders for the final value of lastprivate variables. When foo is simply an integer and no implicit or explicit flush regions are present, the compiler passes a copy of it in the argument to the outlined function:

struct omp_data_s
{
   int foo;
};

void main._omp_fn.0(struct omp_data_s *omp_data_i)
{
   ...
   omp_data_i->foo = i;
   ...
}

int main() {
  volatile int foo = 0;

  struct omp_data_s omp_data_o;
  omp_data_o.foo = foo;

  GOMP_parallel(main._omp_fn.0, &omp_data_o, 0, 0);

  foo = omp_data_o.foo;
  ...
}

omp_data_i is passed via rdi (per the x86-64 ABI) and omp_data_i->foo = i; compiles to simply movl %rax, %(rdi) (given that i is stored in rax) since foo is the first (and only) element of the structure.

When foo is std::atomic_int, it is no longer an integer but a structure wrapping the integer value. In that case, GCC passes a pointer in the parameter structure rather than the value itself:

struct omp_data_s
{
   struct atomic_int *foo;
};

void main._omp_fn.0(struct omp_data_s *omp_data_i)
{
   ...
   __atomic_store_4(&omp_data_i->foo._M_i, i, 0);
   ...
}

int main() {
  struct atomic_int foo;

  struct omp_data_s omp_data_o;
  omp_data_o.foo = &foo;

  GOMP_parallel(main._omp_fn.0, &omp_data_o, 0, 0);

  ...
}

In that case, the additional assembly instruction (movq %(rdi), %rdx) is the dereference of the first pointer (to the OpenMP data structure), the second one is the atomic write (which on x86-64 is simply a store).

When foo is global, it is not passed as part of the argument structure to the outlined code. In that particular case, the code receives a NULL pointer as the argument structure is empty.

void main._omp_fn.0(void *omp_data_i)
{
   ...
   __atomic_store_4(&foo._M_i, i, 0);
   ...
}

why is it loading that pointer on every iteration of the loop since it is (I believe) loop-invariant code;

The pointer argument itself (the value of rdi) is loop invariant, but the value pointed to might change outside of the function as foo is a shared variable. Effectively, GCC treats all variables with OpenMP data-sharing class of shared as volatile. Again, this is an implementation detail as the OpenMP standard allows for a relaxed consistency memory model where writes to shared variables to do not become visible in other threads unless the flush construct is used in both the writer and the reader. GCC is actually taking advantage of that relaxed consistency to optimise the code by passing a copy of some shared variables instead of pointers to the original variables (thus saving one dereference). If there would have been a flush region in your code, either explicit

foo = i;
#pragma omp flush(foo)

or implicit

#pragma omp atomic write
foo = i;

GCC would have passed a pointer to foo instead as seen in the other answer. The reason is that flush constructs synchronise the thread's memory view with the global view, in which the shared foo refers to the original variable (hence a pointer to it instead of a copy).

and what changes to the C++ source can I make to have atomic_int match volatile int in this benchmark?

Besides switching to a different compiler, I can't think of any portable change. GCC passes shared variables of structure type (std::atomic is a structure) as pointers and that's it.

Is OpenMP correct, or is it generating suboptimal code?

OpenMP is correct. It is a multiplaform specification, which defines specific (and intentionally broad) memory and operational semantics that GCC follows. It might not always give you the best performance for particular case on a particular platform, but then the code is portable and it is relatively easy to go from serial to parallel with the addition of a single pragma.

Of course, the GCC people could certainly learn to optimise better - Intel C++ Compiler already does:

                            # LOE rdx ecx
..B1.14:                    # Preds ..B1.15 ..B1.13
    movl      %ecx, %eax                                #13.13
    movl      %eax, (%rdx)                              #13.13
                            # LOE rdx ecx
..B1.15:                    # Preds ..B1.14
    incl      %ecx                                      #12.46
    cmpl      $10000000, %ecx                           #12.34
    jb        ..B1.14       # Prob 99%                  #12.34

Why is the compiler loading this pointer from memory in a loop

Tags:

c++

performance

multithreading

atomic

openmp

Update

J. Doe

1 Answers

Hristo Iliev

Recent Activity

Donate For Us

Why is the compiler loading this pointer from memory in a loop

Tags:

c++

performance

multithreading

atomic

openmp

Update

J. Doe

1 Answers

Hristo Iliev

Related questions

Recent Activity

Donate For Us