Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce or eliminate __tls_init calls?

I am using a third-party library which relies on thread_local. This results in my program calling __tls_init() repeatedly, even in each iteration of some cycles (I haven't checked all of them) despite that the thread_local variables have been unconditionally initialized by another call earlier within the same function (and in fact, near the start of the whole program).

The first instructions in __tls_init() on my x86_64 are

cmpb    $0, %fs:__tls_guard@tpoff
je      .L530
ret
.L530:
pushq   %rbp
pushq   %rbx
subq    (some stack space), %rsp
movb    $1, %fs:__tls_guard@tpoff

so the first time this is called per each thread, the value at %fs:__tls_guard@tpoff is set to 1 and further calls return immediately. But still, this means all the overhead of a call every time a thread_local variable is going to be accessed, right?

Note that this is a statically linked (in fact generated!) function so the compiler "knows" it begins with this condition and it could be perfectly conceivable that the flow analysis finds that it is not necessary to call this function more than once. But it doesn't.

Is it possible to get rid of the superfluous call __tls_init instructions, or at least, to stop the compiler from emitting them in time-critical sections?

Example situation from actual compilation: (-O3)

pushq   %r13
movq    %rdi, %r13
pushq   %r12
pushq   %rbp
pushq   %rbx
movq    %rsi, %rbx
subq    $88, %rsp
call    __tls_init              // always gets called
movq    (%rbx), %rdi
call    <some local function>
movq    8(%rax), %r12
subq    (%rax), %r12
movq    %rax, %rbp
sarq    $4, %r12
cmpq    $1, %r12
jbe .L6512
leaq    -2(%r12), %rax
movq    $0, (%rsp)
leaq    48(%rsp), %rbx
movq    %rax, 8(%rsp)
.L6506:
call    __tls_init              // needless and called potentially very many times!
movq    %rsp, %rsi
movq    %rsp, %rdi
addq    $8, %rbx
call    <some other local function>
movq    %rax, -8(%rbx)
leaq    80(%rsp), %rax
cmpq    %rbx, %rax
jne .L6506                      // cycle

Update: the source code of the above is overly complicated. Here's a MWE:

void external(int);

struct X {
  volatile int a;   // to prevent optimizing to a constexpr
  X() { a = 5; }    // to enforce calling a c-tor for thread_local
  void f() { external(a); } // to prevent disregarding the value of a
};

thread_local X x;

void f() {
  x.f();
  for(int j = 0; j < 10; j++)
    x.f();  // x is totally initialized now
}

If you see this analyzed with maximum optimization settings in the Compiler Explorer (link to this particular example), you'll notice the same phenomenon of checking fs:__tls_guard@tpoff against 0 redundantly in every repetition of the loop after putting a 1 there, namely in label .L4 (assuming the output will stay the same), even though __tls_init is inlined in this super-simple case.

Although this question is about G++, CLang (see in Compiler Explorer) makes this even more obvious.

One could say that the external function call could overwrite the stored value in this example. But then what would be guaranteed? If so it could also disrespect calling conventions. In these respects the compiler just has to assume it will play nice. Besides, there were no external functions in my main code above and a single translation unit, just rather large (turns out in small examples like the MWE the compiler will detect and remove the extraneous tests, showing that it must be possible somehow).

like image 216
The Vee Avatar asked Oct 27 '16 13:10

The Vee


Video Answer


1 Answers

I don't know if there is any compiler option to eliminate the tls call, but your specific code could be optimized by using a pointer to the TLS object in the function:

void f() {
  auto ptr = &x;
  ptr->f();
  for(int j = 0; j < 10; j++)
    ptr->f(); 
}
like image 158
tristan Avatar answered Sep 29 '22 00:09

tristan