Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is thread local storage not implemented with page table mappings?

I was hoping to use the C++11 thread_local keyword for a per-thread boolean flag that is going to be accessed very frequently.

However, most compilers seem to implemented thread local storage with a table that maps integer IDs (slots) to the variable's address on the current thread. This lookup would happen inside a performance-critical code path, so I have some concerns about its performance.

The way I would have expected thread local storage to be implemented is by allocating virtual memory ranges that are backed by different physical pages depending on the thread. That way, accessing the flag would be the same cost as any other memory access, since the MMU takes care of the mapping.

Why do none of the mainstream compilers take advantage of page table mappings in this way?

I suppose I can implement my own "thread-specific page" with mmap on Linux and VirtualAlloc on Win32, but this seems like a pretty common use-case. If anyone knows of existing or better solutions, please point me to them.

I've also considered storing an std::atomic<std::thread::id> inside each object to represent the active thread, but profiling shows that the check for std::this_thread::get_id() == active_thread is quite expensive.

like image 988
troniacl Avatar asked Oct 18 '14 08:10

troniacl


People also ask

What is the purpose of thread-local storage?

Thread Local Storage (TLS) is the mechanism by which each thread in a given multithreaded process allocates storage for thread-specific data. In standard multithreaded programs, data is shared among all threads of a given process, whereas thread local storage is the mechanism for allocating per-thread data.

Which of the following uses the concept of thread-local storage?

Thread Local Storage (TLS) is the method by which each thread in a given multithreaded process can allocate locations in which to store thread-specific data. Dynamically bound (run-time) thread-specific data is supported by way of the TLS API (TlsAlloc).

What is thread-local storage in c++?

Threads share the data of the process to which it belongs to. This data sharing provides one of the benefits of multithreaded programming. However, in some circumstances, each thread might need its own copy of certain data. Such data is called thread-local storage (or TLS).

Is thread-local storage slow?

TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead. I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.


2 Answers

On Linux/x86-64 thread local storage is implemented thru a special segment register %fs (per x86-64 ABI page 23...)

So the following code (I'm using C + GCC extension __thread syntax, but it is the same as C++11 thread_local)

__thread int x;
int f(void) { return x; }

is compiled (with gcc -O -fverbose-asm -S) into:

         .text
 .Ltext0:
         .globl  f
         .type   f, @function
 f:
 .LFB0:
         .file 1 "tl.c"
         .loc 1 3 0
         .cfi_startproc
         .loc 1 3 0
         movl    %fs:x@tpoff, %eax       # x,
         ret
         .cfi_endproc
 .LFE0:
         .size   f, .-f
         .globl  x
         .section        .tbss,"awT",@nobits
         .align 4
         .type   x, @object
         .size   x, 4
 x:
         .zero   4

Therefore, contrarily to your fears, access to TLS is really quick on Linux/x86-64. It is not exactly implemented as a table (instead the kernel & runtime manage the %fs segment register to point to a thread-specific memory zone, and the compiler & linker manage the offset there). However, old pthread_getspecific indeed went thru a table, but is nearly useless once you have TLS.

BTW, by definition, all threads in the same process share the same address space in virtual memory, since a process has its own single address space. (see /proc/self/maps etc... see proc(5) for more about /proc/, and also mmap(2); the C++11 thread library is based on pthreads which are implemented using clone(2)). So "thread-specific memory mapping" is a contradiction: once a task (the thing which is run by the kernel scheduler) has its own address space, it is called a process (not a thread). The defining characteristic of threads in the same process is to share a common address space (and some other entities, like file descriptors).

like image 82
Basile Starynkevitch Avatar answered Sep 22 '22 10:09

Basile Starynkevitch


The suggestion doesn't work, because it would prevent other threads from accessing your thread_local variables via a pointer. Those threads would end up accessing their own copy of that variable.

Say for example that you have a main thread and 100 worker threads. The worker_threads pass a pointer to their own thread_local variable back to the main thread. The main thread now has 100 pointers to those 100 variables. If the TLS memory was page-table mapped as suggested, the main thread would have 100 identical pointers to a single, uninitialized variable in the TLS of the main thread - certainly not what was intended!

like image 22
MSalters Avatar answered Sep 22 '22 10:09

MSalters