I am using a third-party library which relies on thread_local
. This results in my program calling __tls_init()
repeatedly, even in each iteration of some cycles (I haven't checked all of them) despite that the thread_local
variables have been unconditionally initialized by another call earlier within the same function (and in fact, near the start of the whole program).
The first instructions in __tls_init()
on my x86_64
are
cmpb $0, %fs:__tls_guard@tpoff
je .L530
ret
.L530:
pushq %rbp
pushq %rbx
subq (some stack space), %rsp
movb $1, %fs:__tls_guard@tpoff
so the first time this is called per each thread, the value at %fs:__tls_guard@tpoff
is set to 1
and further calls return immediately. But still, this means all the overhead of a call
every time a thread_local
variable is going to be accessed, right?
Note that this is a statically linked (in fact generated!) function so the compiler "knows" it begins with this condition and it could be perfectly conceivable that the flow analysis finds that it is not necessary to call this function more than once. But it doesn't.
Is it possible to get rid of the superfluous call __tls_init
instructions, or at least, to stop the compiler from emitting them in time-critical sections?
Example situation from actual compilation: (-O3)
pushq %r13
movq %rdi, %r13
pushq %r12
pushq %rbp
pushq %rbx
movq %rsi, %rbx
subq $88, %rsp
call __tls_init // always gets called
movq (%rbx), %rdi
call <some local function>
movq 8(%rax), %r12
subq (%rax), %r12
movq %rax, %rbp
sarq $4, %r12
cmpq $1, %r12
jbe .L6512
leaq -2(%r12), %rax
movq $0, (%rsp)
leaq 48(%rsp), %rbx
movq %rax, 8(%rsp)
.L6506:
call __tls_init // needless and called potentially very many times!
movq %rsp, %rsi
movq %rsp, %rdi
addq $8, %rbx
call <some other local function>
movq %rax, -8(%rbx)
leaq 80(%rsp), %rax
cmpq %rbx, %rax
jne .L6506 // cycle
Update: the source code of the above is overly complicated. Here's a MWE:
void external(int);
struct X {
volatile int a; // to prevent optimizing to a constexpr
X() { a = 5; } // to enforce calling a c-tor for thread_local
void f() { external(a); } // to prevent disregarding the value of a
};
thread_local X x;
void f() {
x.f();
for(int j = 0; j < 10; j++)
x.f(); // x is totally initialized now
}
If you see this analyzed with maximum optimization settings in the Compiler Explorer (link to this particular example), you'll notice the same phenomenon of checking fs:__tls_guard@tpoff
against 0
redundantly in every repetition of the loop after putting a 1 there, namely in label .L4
(assuming the output will stay the same), even though __tls_init
is inlined in this super-simple case.
Although this question is about G++, CLang (see in Compiler Explorer) makes this even more obvious.
One could say that the external function call could overwrite the stored value in this example. But then what would be guaranteed? If so it could also disrespect calling conventions. In these respects the compiler just has to assume it will play nice. Besides, there were no external functions in my main code above and a single translation unit, just rather large (turns out in small examples like the MWE the compiler will detect and remove the extraneous tests, showing that it must be possible somehow).
I don't know if there is any compiler option to eliminate the tls call, but your specific code could be optimized by using a pointer to the TLS object in the function:
void f() {
auto ptr = &x;
ptr->f();
for(int j = 0; j < 10; j++)
ptr->f();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With