Thread local real usage of the underlying segment registers

I read a number of articles and S/O answers saying that (on linux x86_64) FS (or GS in some variants) references a thread-specific page table entry, which then gives an array of pointers to the actual data that is in sharable data. When threads are swapped, all the registers are switched over, and the threaded base page therefore changes. Threaded variables are accessed by name with just 1 extra pointer hop, and the referenced values can be shared to other threads. All good and plausible.

Indeed, if you look at the code for __errno_location(void), the function behind errno, you find something like (this is from musl, but gnu is not so much different):

static inline struct pthread *__pthread_self()
{
    struct pthread *self;
    __asm__ __volatile__ ("mov %%fs:0,%0" : "=r" (self) );
    return self;
}

And from glibc:

=> 0x7ffff6efb4c0 <__errno_location>:   endbr64
   0x7ffff6efb4c4 <__errno_location+4>: mov    0x6add(%rip),%rax        # 0x7ffff6f01fa8
   0x7ffff6efb4cb <__errno_location+11>:        add    %fs:0x0,%rax
   0x7ffff6efb4d4 <__errno_location+20>:        retq

So my expectation is that the actual value for FS would change for each thread. E.g. under the debugger, gdb: info reg or p $fs, I would see the value of FS be different in different threads, but no: ds, es, fs, gs are all zero all the time.

In my own code, I write something like below and get the same - FS is unchanged but the TLV "works":

struct Segregs
{
    unsigned short int  cs, ss, ds, es, fs, gs;
    friend std::ostream& operator << (std::ostream& str, const Segregs& sr)
    {
        str << "[cs:" << sr.cs << ",ss:" << sr.ss << ",ds:" << sr.ds
            << ",es:" << sr.es << ",fs:" << sr.fs << ",gs:" << sr.gs << "]";
        return str;
    }
};

Segregs GetSegRegs()
{
    unsigned short int  r_cs, r_ss, r_ds, r_es, r_fs, r_gs;
    __asm__ __volatile__ ("mov %%cs,%0" : "=r" (r_cs) );
    __asm__ __volatile__ ("mov %%ss,%0" : "=r" (r_ss) );
    __asm__ __volatile__ ("mov %%ds,%0" : "=r" (r_ds) );
    __asm__ __volatile__ ("mov %%es,%0" : "=r" (r_es) );
    __asm__ __volatile__ ("mov %%fs,%0" : "=r" (r_fs) );
    __asm__ __volatile__ ("mov %%gs,%0" : "=r" (r_gs) );
    return {r_cs, r_ss, r_ds, r_es, r_fs, r_gs};
}

But the output?

Main: Seg regs : [cs:51,ss:43,ds:0,es:0,fs:0,gs:0]
Main:    tls    @0x7ffff699307c=0
Main:    static @0x96996c=0
 Modified to 1234
Main:    tls    @0x7ffff699307c=1234
Main:    static @0x96996c=1234

 Async thread
[New Thread 0x7ffff695e700 (LWP 3335119)]
Thread: Seg regs : [cs:51,ss:43,ds:0,es:0,fs:0,gs:0]
Thread:  tls    @0x7ffff695e6fc=0
Thread:  static @0x96996c=1234

So something else is actually going on? What extra trickery is happening, and why add the complication?

For context I'm trying to do something "funky with forks", so I would like to know the gory detail.

What is thread local storage used for?

Thread Local Storage (TLS) is the mechanism by which each thread in a given multithreaded process allocates storage for thread-specific data. In standard multithreaded programs, data is shared among all threads of a given process, whereas thread local storage is the mechanism for allocating per-thread data.

What is the use of segment registers?

A segment register changes the memory address accessed by 16 bits at a time, because its value is shifted left by 4 bits (or multiplied by 16) to cover the entire 20-bit address space. The segment register value is added to the addressing register's 16-bit value to produce the actual 20-bit memory address.

What are the names of the 4 segment registers?

8086 Segment Registers The 8086 has four special segment registers: cs, ds, es, and ss. These stand for Code Seg- ment, Data Segment, Extra Segment, and Stack Segment, respectively. These registers are all 16 bits wide. They deal with selecting blocks (segments) of main memory.

What is a thread local in C?

Thread-local storage ( TLS ) is a mechanism by which variables are allocated such that there is one instance of the variable per extant thread. The runtime model GCC uses to implement this originates in the IA-64 processor-specific ABI, but has since been migrated to other processors as well.

In 64-bit mode, the actual contents of the 16-bit FS and GS segment registers are normally the "null selector" (0), because other mechanisms are used to set the segment bases with 64-bit values. (MSR or wrfsbase)

Like in protected mode, there are separate "FSBASE" and "GSBASE" registers within the CPU, and when you specify, say, an FS segment override to an instruction, the base address from the FSBASE register is added to the operand's effective address to determine the actual linear address to be accessed.

The kernel's context structure for each thread stores a copy of its FSBASE and GSBASE registers, and they are reloaded appropriately on each context switch.

So what actually happens is that each thread sets its FSBASE register to point to its own thread-local storage. (Depending on the CPU features and OS design, this may only be possible for privileged code, so a system call may be required.) Then instructions with an FS segment override can be used to access an object with a given offset in the thread-local storage block, as you've seen.

In 32-bit mode, on the other hand, the values in FS and GS do have more meaning; they are segment selectors which are used to index into a descriptor table maintained by the kernel. The descriptor table holds the actual segment info, including its base address, and you could use a system call to ask the kernel to modify it. Each thread would have its own local descriptor table, so you wouldn't necessarily see different selectors in FS for different threads, but it would still be the case that FS-override instructions from different threads would result in accesses to different linear addresses.

(Or a 32-bit kernel could write into a GDT entry and mov a constant from a register into fs or gs to get it to reload that newly-written GDT entry. So it would only need a GDT per logical core instead an LDT per process. The CPU never reloads a segment descriptor on its own, although with a per-core GDT the entry would still match the current task if you had separate entries for FS and GS. So user-space might not break itself with mov eax,gs / mov gs,eax.)

Anyway, this is really just the lack of a convenient MSR or wrfsbase way to set segment register bases separately from mov Sreg, r/m triggering the CPU to load the descriptor. In either protected or long mode, in the segment register value does need to be valid (including null = 0), and moving some random value into it will likely fault.

Thread local real usage of the underlying segment registers

Tags:

c++

linux

x86-64

thread-local-storage

memory-segmentation

Gem Taylor

People also ask

1 Answers

Nate Eldredge

Recent Activity

Donate For Us

Thread local real usage of the underlying segment registers

Tags:

c++

linux

x86-64

thread-local-storage

memory-segmentation

Gem Taylor

People also ask

1 Answers

Nate Eldredge

Related questions

Recent Activity

Donate For Us