Why is thread local storage so slow?

Tags:

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?

Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

712

asked Feb 03 '09 05:02

dsimcha

2 Answers

The speed depends on the TLS implementation.

Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.

For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.

Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.

If the scheduler does not support any of these methods, the compiler/library has to do the following:

get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.

Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.

The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).

Unfortunately it's not uncommon to see the slow TLS implementation in practice.

185

answered Oct 09 '22 06:10

Nils Pipenbrinck

Thread locals in D are really fast. Here are my tests.

64 bit Ubuntu, core i5, dmd v2.052 Compiler options: dmd -O -release -inline -m64

// this loop takes 0m0.630s void main(){     int a; // register allocated     for( int i=1000*1000*1000; i>0; i-- ){         a+=9;     } }  // this loop takes 0m1.875s int a; // thread local in D, not static void main(){     for( int i=1000*1000*1000; i>0; i-- ){         a+=9;     } }

So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses. Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:

Disassembling with objdump -d:

- this is local variable in %ecx register (loop counter in %eax):    8:   31 c9                   xor    %ecx,%ecx    a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax    f:   83 c1 09                add    $0x9,%ecx   12:   ff c8                   dec    %eax   14:   85 c0                   test   %eax,%eax   16:   75 f7                   jne    f <_Dmain+0xf>  - this is thread local, %fs register is used for indirection, %edx is loop counter:    6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx    b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax   12:   00 00    14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>   1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)   1f:   ff ca                   dec    %edx   21:   85 d2                   test   %edx,%edx   23:   75 e6                   jne    b <_Dmain+0xb>

Maybe compiler could be even more clever and cache thread local before loop to a register and return it to thread local at the end (it's interesting to compare with gdc compiler), but even now matters are very good IMHO.

answered Oct 09 '22 06:10

Andriy

Related questions
                            
                                Calling condition.wait() inside thread causes retrieval of any future to block on main thread
                            
                                What is the difference between Child_process and Worker Threads?
                            
                                Why do I need a memory barrier?
                            
                                Existing threadpool C implementation [closed]
                            
                                How does `this` reference to an outer class escape through publishing inner class instance?
                            
                                Can I set the number of Threads/CPUs available to the Java VM?
                            
                                Making sure OnPropertyChanged() is called on UI thread in MVVM WPF app
                            
                                Python creating a shared variable between threads
                            
                                Misunderstanding the difference between single-threading and multi-threading programming
                            
                                std::map thread-safety
                            
                                In what situation do you use a semaphore over a mutex in C++?
                            
                                using ThreadStatic variables with async/await
                            
                                Thread safe lazy construction of a singleton in C++
                            
                                How many threads can I run concurrently?
                            
                                Why no AutoResetEventSlim in BCL?
                            
                                Using static mutex in a class
                            
                                Behavior of Python's time.sleep(0) under linux - Does it cause a context switch?
                            
                                How to continue one thread at a time when debugging a multithreaded program in GDB?
                            
                                Does Java have support for multicore processors/parallel processing?
                            
                                GPGPU vs. Multicore?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is thread local storage so slow?

Tags:

performance

multithreading

d

thread-local-storage

dsimcha

People also ask

2 Answers

Nils Pipenbrinck

Andriy

Recent Activity

Donate For Us