I have a simple question, can C++11 thread_local
be used with other parallel models.
For example, can I use it within a function while using OpenMP or Intel TBB to parallel the tasks.
Most such parallel programming models hide hardware threads behind higher level API. My instinct is that they all have to map their task schedulers into hardware threads. Can I expect that C++11 thread_local
will have expected effect.
A simple example is,
void func ()
{
static thread_local some_var = init_val;
#pragma omp parallel for [... clauses ...]
for (int i = 0; i < N; ++i) {
// access some_var somewhere within the loop
}
}
Can I expect that each OpenMP thread will access its own copy of some_var
?
I know that most parallel programming models have their own constructs for thread-local storage. However, having the ability to use C++11 thread_local
(or compiler specific keyword) is nice. For example, consider the situation
// actually may implemented with a class with operator()
void func ()
{
static thread_local some_var;
// a quite complex function
}
void func_omp (int N)
{
#pragma omp for [... clauses ...]
for (int i = 0; i < N; ++i)
func();
}
void func_tbb (int N)
{
tbb::parallel_for(tbb::blocked_range<int>(0, N), func);
}
void func_select (int N)
{
// At runtime or at compile time, based which programming model is available,
// select to run func_omp or func_tbb
}
The basic idea here is that func
may be quite complex. I want to support multiple parallel programming models. If I use parallel programming specific thread-local constructs, then I have implement different versions of func
or at least partial of it. However, if I can freely use C++11 thread_local
, then in addition to func
I only need to implement a few very simple functions. And for a larger project things can be further simplified by using templates to write more generic versions of func_omp
, func_tbb
. However, I am not quite sure it is safe to do so.
In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.
Thread Local Storage (TLS) is the mechanism by which each thread in a given multithreaded process allocates storage for thread-specific data. In standard multithreaded programs, data is shared among all threads of a given process, whereas thread local storage is the mechanism for allocating per-thread data.
The __thread storage class marks a static variable as having thread-local storage duration. This means that in a multi-threaded application a unique instance of the variable is created for each thread that uses it and destroyed when the thread terminates.
With thread local storage (TLS), you can provide unique data for each thread that the process can access using a global index. One thread allocates the index, which can be used by the other threads to retrieve the unique data associated with the index.
On the one side, the OpenMP specification intentionally omits any specifications concerning interoperability with other programming paradigms and any mixing of C++11 threading with OpenMP is non-standard and vendor-specific. On the other side, compilers (at least GCC) tend to use the same underlying TLS mechanism to implement OpenMP's #pragma omp threadprivate
, C++11's thread_local
and the various compiler-specific storage classes like __thread
.
For example, GCC implements its OpenMP runtime (libgomp) entirely on top of the POSIX threads API and implements OpenMP threadprivate
by placing the variables on the ELF TLS storage. This interoperates with GNU's C++11 implementation that also uses POSIX threads and places thread_local
variables on the ELF TLS storage. Ultimately this interoperates with code that uses the __thread
keyword to specify thread-local storage class and explicit POSIX threads API calls. For example, the following code:
int foo;
#pragma omp threadprivate(foo)
__thread int bar;
thread_local int baz;
int func(void)
{
return foo + bar + baz;
}
compiles into:
.globl foo
.section .tbss,"awT",@nobits
.align 4
.type foo, @object
.size foo, 4
foo:
.zero 4
.globl bar
.align 4
.type bar, @object
.size bar, 4
bar:
.zero 4
.globl baz
.align 4
.type baz, @object
.size baz, 4
baz:
.zero 4
movl %fs:foo@tpoff, %edx
movl %fs:bar@tpoff, %eax
addl %eax, %edx
movl %fs:baz@tpoff, %eax
Here the .tbss
ELF section is the thread-local BSS (uninitialised data). All three variables are created and accessed in the same way.
Interoperability is of less concern right now with other compilers. Intel does not implement thread_local
while Clang still misses OpenMP support.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With