Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The Cost of thread_local

Now that C++ is adding thread_local storage as a language feature, I'm wondering a few things:

  1. What is the cost of thead_local likely to be?
    • In memory?
    • For read and write operations?
  2. Associated with that: how do Operating Systems usually implement this? It would seem like anything declared thread_local would have to be given thread-specific storage space for each thread created.
like image 536
John Avatar asked Dec 13 '11 13:12

John


2 Answers

Storage space: size of the variable * number of threads, or possibly (sizeof(var) + sizeof(var*)) * number of threads.

There are two basic ways of implementing thread-local storage:

  1. Using some sort of system call that gets information about the current kernel thread. Sloooow.

  2. Using some pointer, probably in a processor register, that is set properly at every thread context switch by the kernel - at the same time as all the other registers. Cheap.

On intel platforms, variant 2 is usually implemented via some segment register (FS or GS, I don't remember). Both GCC and MSVC support this. Access times are therefore about as fast as for global variables.

It is also possible, but I haven't seen it yet in practice, for this to be implemented via existing library functions like pthread_getspecific. Performance would then be like 1. or 2., plus library call overhead. Keep in mind that variant 2. + library call overhead is still a lot faster than a kernel call.

like image 174
wolfgang Avatar answered Sep 19 '22 14:09

wolfgang


A description for how it works on Linux by Uli Drepper (maintainer of glibc) can be found here: www.akkadia.org/drepper/tls.pdf

The requirement to handle dynamically loaded modules etc. make the entire mechanism a bit convoluted, which perhaps partly explains why the document weights in at 79 pages (!).

Memory-usage-wise, each per-thread variable obviously needs it's own per-thread memory (although in some cases this can be done lazily such that the space is allocated only once the variable is first accessed), and then there's some extra datastructures that are needed for offset tables etc.

Performance-wise, the extra cost to access a TLS variable mostly revolves around retrieving the address of the variable. On x86 Linux the GS register is used as a start to get a thread id, on x86-64 FS. Usually there is a few pointer dereferences, and a function call (__tls_get_addr) for dynamically loaded code. There's also the cost that creating a new thread is slower because the implementation needs to allocate space and possibly initialize all the TLS vars (if not done lazily).

TLS is nice for easily making some old thread-unsafe code patterns thread-safe (think errno), but for new code designed from the start for a multi-threaded world it's very seldom needed.

like image 41
janneb Avatar answered Sep 22 '22 14:09

janneb