I am trying to determine what overhead std::atomic
introduces to unconditional memory writes on my system (eight-core x64). Here is my benchmark program:
#include <atomic>
#include <iostream>
#include <omp.h>
int main() {
std::atomic_int foo(0); // VERSION 1
//volatile int foo = 0; // VERSION 2
#pragma omp parallel
for (unsigned int i = 0; i < 10000000; ++i) {
foo.store(i, std::memory_order_relaxed); // VERSION 1
//foo = i; // VERSION 2
}
std::cout << foo << std::endl;
}
The program as-is will benchmark std::atomic_int
, and commenting the lines labeled VERSION 1
and uncommenting the lines labeled VERSION 2
will test volatile int
in its place. Even when unsynchronized, the output of both programs should be 10000000 - 1.
This is my command-line:
g++ -O2 -std=c++11 -fopenmp test.c++
The version that uses atomic_int
takes between two and three seconds on my system, while the one that uses volatile int
almost always completes in less than a tenth of a second.
The salient difference in the assembly is this (output from diff --side-by-side
):
volatile int atomic_int
.L2: .L2:
mov DWORD PTR [rdi], eax | mov rdx, QWORD PTR [rdi]
> mov DWORD PTR [rdx], eax
add eax, 1 add eax, 1
cmp eax, 10000000 cmp eax, 10000000
jne .L2 jne .L2
rep ret rep ret
rdi
is the first argument to this function that gets run in parallel (it is not modified anywhere in the function), and it is apparently a pointer to (a pointer to, in the second column) the integer foo
. I do not believe that this extra mov
is integral to the atomicity guarantee of atomic_int
.
The extra mov
is indeed the source of the slowdown for atomic_int
; moving it above L2
allows both versions to achieve the same performance and both output the correct number.
When foo
is made a global variable, atomic_int
attains the same increased performance of volatile int
.
My questions are these: Why is the compiler passing a pointer to a pointer in the case of a stack-allocated atomic_int
but only a pointer in the case of global atomic_int
or stack-allocated volatile int
; why is it loading that pointer on every iteration of the loop since it is (I believe) loop-invariant code; and what changes to the C++ source can I make to have atomic_int
match volatile int
in this benchmark?
Running this program:
#include <atomic>
#include <iostream>
#include <thread>
//using T = volatile int; // VERSION 1
using T = std::atomic_int; // VERSION 2
void foo(T* ptr) {
for (unsigned int i = 0; i < 10000000; ++i) {
//*ptr = i; // VERSION 1
ptr->store(i, std::memory_order_relaxed); // VERSION2
}
}
int main() {
T i { 0 };
std::thread threads[4];
for (auto& x : threads)
x = std::move(std::thread { foo, &i });
for (auto& x : threads)
x.join();
std::cout << i << std::endl;
}
yields the same, improved performance for both versions 1 and 2, which leads me to believe that it's a peculiarity of OpenMP that forces the worse perf for atomic_int
. Is OpenMP correct, or is it generating suboptimal code?
Things get much easier to understand if you look at the intermediate representation (-fdump-tree-all
is your friend there) of the program rather than at the assembly output.
Why is the compiler passing a pointer to a pointer in the case of a stack-allocated
atomic_int
but only a pointer in the case of globalatomic_int
or stack-allocatedvolatile int
;
This is an implementation detail. GCC transforms parallel regions by outlining them into separate functions that then receive as their sole argument a structure containing all the shared variables, also the initial value of firstprivate
and placeholders for the final value of lastprivate
variables. When foo
is simply an integer and no implicit or explicit flush
regions are present, the compiler passes a copy of it in the argument to the outlined function:
struct omp_data_s
{
int foo;
};
void main._omp_fn.0(struct omp_data_s *omp_data_i)
{
...
omp_data_i->foo = i;
...
}
int main() {
volatile int foo = 0;
struct omp_data_s omp_data_o;
omp_data_o.foo = foo;
GOMP_parallel(main._omp_fn.0, &omp_data_o, 0, 0);
foo = omp_data_o.foo;
...
}
omp_data_i
is passed via rdi
(per the x86-64 ABI) and omp_data_i->foo = i;
compiles to simply movl %rax, %(rdi)
(given that i
is stored in rax
) since foo
is the first (and only) element of the structure.
When foo
is std::atomic_int
, it is no longer an integer but a structure wrapping the integer value. In that case, GCC passes a pointer in the parameter structure rather than the value itself:
struct omp_data_s
{
struct atomic_int *foo;
};
void main._omp_fn.0(struct omp_data_s *omp_data_i)
{
...
__atomic_store_4(&omp_data_i->foo._M_i, i, 0);
...
}
int main() {
struct atomic_int foo;
struct omp_data_s omp_data_o;
omp_data_o.foo = &foo;
GOMP_parallel(main._omp_fn.0, &omp_data_o, 0, 0);
...
}
In that case, the additional assembly instruction (movq %(rdi), %rdx
) is the dereference of the first pointer (to the OpenMP data structure), the second one is the atomic write (which on x86-64 is simply a store).
When foo
is global, it is not passed as part of the argument structure to the outlined code. In that particular case, the code receives a NULL pointer as the argument structure is empty.
void main._omp_fn.0(void *omp_data_i)
{
...
__atomic_store_4(&foo._M_i, i, 0);
...
}
why is it loading that pointer on every iteration of the loop since it is (I believe) loop-invariant code;
The pointer argument itself (the value of rdi
) is loop invariant, but the value pointed to might change outside of the function as foo
is a shared variable. Effectively, GCC treats all variables with OpenMP data-sharing class of shared
as volatile
. Again, this is an implementation detail as the OpenMP standard allows for a relaxed consistency memory model where writes to shared variables to do not become visible in other threads unless the flush
construct is used in both the writer and the reader. GCC is actually taking advantage of that relaxed consistency to optimise the code by passing a copy of some shared variables instead of pointers to the original variables (thus saving one dereference). If there would have been a flush
region in your code, either explicit
foo = i;
#pragma omp flush(foo)
or implicit
#pragma omp atomic write
foo = i;
GCC would have passed a pointer to foo
instead as seen in the other answer. The reason is that flush
constructs synchronise the thread's memory view with the global view, in which the shared foo
refers to the original variable (hence a pointer to it instead of a copy).
and what changes to the C++ source can I make to have
atomic_int
matchvolatile int
in this benchmark?
Besides switching to a different compiler, I can't think of any portable change. GCC passes shared variables of structure type (std::atomic
is a structure) as pointers and that's it.
Is OpenMP correct, or is it generating suboptimal code?
OpenMP is correct. It is a multiplaform specification, which defines specific (and intentionally broad) memory and operational semantics that GCC follows. It might not always give you the best performance for particular case on a particular platform, but then the code is portable and it is relatively easy to go from serial to parallel with the addition of a single pragma.
Of course, the GCC people could certainly learn to optimise better - Intel C++ Compiler already does:
# LOE rdx ecx
..B1.14: # Preds ..B1.15 ..B1.13
movl %ecx, %eax #13.13
movl %eax, (%rdx) #13.13
# LOE rdx ecx
..B1.15: # Preds ..B1.14
incl %ecx #12.46
cmpl $10000000, %ecx #12.34
jb ..B1.14 # Prob 99% #12.34
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With