Apologies if this question is stupid. I tried to find an answer online for quite some time, but couldn't and hence I'm asking here. I am learning threads, and I've been going through this link and this Linux Plumbers Conference 2013 videoabout kernel level and user level threads, and as far as I understood, using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside. In such a case, <ul> <li>who decides the scheduling of these user threads during the timeslice the process gets, as the kernel sees it as a single process and is unaware of the threads, and how is the scheduling done? </li> <li>If pthreads create user level threads, how are kernel level or OS threads created from user space programs, if required?</li> <li>According to the above link, it says Operating Systems kernel provides system call to create and manage threads. So does a <code>clone()</code> system call creates a kernel level thread or user level thread? <ul> <li>If it creates a kernel level thread, then <code>strace</code> of a simple pthreads program also shows using clone() while executing, but then why would it be considered user level thread?</li> <li>If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?</li> </ul> </li> <li>According to the link, it says "It require a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity.", so in kernel level threads, only the heap is shared, and the rest all are individual to the thread?</li> </ul> Edit: I was asking about the user-level thread creation, and it's scheduling because here, there is a reference to Many to One Model where many user level threads are mapped to one Kernel-level thread, and Thread management is done in user space by the thread library. I've been only seeing references to using pthreads, but unsure if it creates user-level or kernel-level threads.

This is prefaced by the top comments. The documentation you're reading is generic [not linux specific] and a bit outdated. And, more to the point, it is using different terminology. That is, I believe, the source of the confusion. So, read on ... What it calls a "user-level" thread is what I'm calling an [outdated] LWP thread. What it calls a "kernel-level" thread is what is called a native thread in linux. Under linux, what is called a "kernel" thread is something else altogether [See below]. <blockquote> using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside. </blockquote> This was how userspace threads were done prior to the <code>NPTL</code> (native posix threads library). This is also what SunOS/Solaris called an <code>LWP</code> lightweight process. There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some such]. The kernel was not aware of this. The kernel didn't yet understand or provide support for threads. But, because, these "lightweight" threads were switched by code in the userspace based thread master (aka "lightweight process scheduler") [just a special user program/process], they were very slow to switch context. Also, before the advent of "native" threads, you might have 10 processes. Each process gets 10% of the CPU. If one of the processes was an LWP that had 10 threads, these threads had to share that 10% and, thus, got only 1% of the CPU each. All this was replaced by the "native" threads that the kernel's scheduler is aware of. This changeover was done 10-15 years ago. Now, with the above example, we have 20 threads/processes that each get 5% of the CPU. And, the context switch is much faster. It is still possible to have an LWP system under a native thread, but, now, that is a design choice, rather than a necessity. Further, LWP works great if each thread "cooperates". That is, each thread loop periodically makes an explicit call to a "context switch" function. It is voluntarily relinquishing the process slot so another LWP can run. However, the pre-NPTL implementation in <code>glibc</code> also had to [forcibly] preempt LWP threads (i.e. implement timeslicing). I can't remember the exact mechanism used, but, here's an example. The thread master had to set an alarm, go to sleep, wake up and then send the active thread a signal. The signal handler would effect the context switch. This was messy, ugly, and somewhat unreliable. <blockquote> Joachim mentioned <code>pthread_create</code> function creates a kernel thread </blockquote> That is [technically] incorrect to call it a kernel thread. <code>pthread_create</code> creates a native thread. This is run in userspace and vies for timeslices on an equal footing with processes. Once created there is little difference between a thread and a process. The primary difference is that a process has its own unique address space. A thread, however, is a process that shares its address space with other process/threads that are part of the same thread group. <blockquote> If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs? </blockquote> Kernel threads are not userspace threads, NPTL, native, or otherwise. They are created by the kernel via the <code>kernel_thread</code> function. They run as part of the kernel and are not associated with any userspace program/process/thread. They have full access to the machine. Devices, MMU, etc. Kernel threads run in the highest privilege level: ring 0. They also run in the kernel's address space and not the address space of any user process/thread. A userspace program/process may not create a kernel thread. Remember, it creates a native thread using <code>pthread_create</code>, which invokes the <code>clone</code> syscall to do so. Threads are useful to do things, even for the kernel. So, it runs some of its code in various threads. You can see these threads by doing <code>ps ax</code>. Look and you'll see <code>kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration</code>, etc. These are kernel threads and not programs/processes. <hr> UPDATE: <blockquote> You mentioned that kernel doesn't know about user threads. </blockquote> Remember that, as mentioned above, there are two "eras". (1) Before the kernel got thread support (circa 2004?). This used the thread master (which, here, I'll call the LWP scheduler). The kernel just had the <code>fork</code> syscall. (2) All kernels after that which do understand threads. There is no thread master, but, we have <code>pthreads</code> and the <code>clone</code> syscall. Now, <code>fork</code> is implemented as <code>clone</code>. <code>clone</code> is similar to <code>fork</code> but takes some arguments. Notably, a <code>flags</code> argument and a <code>child_stack</code> argument. More on this below ... <blockquote> then, how is it possible for user level threads to have individual stacks? </blockquote> There is nothing "magic" about a processor stack. I'll confine discussion [mostly] to x86, but this would be applicable to any architecture, even those that don't even have a stack register (e.g. 1970's era IBM mainframes, such as the IBM System 370) Under x86, the stack pointer is <code>%rsp</code>. The x86 has <code>push</code> and <code>pop</code> instructions. We use these to save and restore things: <code>push %rcx</code> and [later] <code>pop %rcx</code>. But, suppose the x86 did not have <code>%rsp</code> or <code>push/pop</code> instructions? Could we still have a stack? Sure, by convention. We [as programmers] agree that (e.g.) <code>%rbx</code> is the stack pointer. In that case, a "push" of <code>%rcx</code> would be [using AT&T assembler]: <pre class="prettyprint"><code>subq $8,%rbx movq %rcx,0(%rbx) </code></pre> And, a "pop" of <code>%rcx</code> would be: <pre class="prettyprint"><code>movq 0(%rbx),%rcx addq $8,%rbx </code></pre> To make it easier, I'm going to switch to C "pseudo code". Here are the above push/pop in pseudo code: <pre class="prettyprint"><code>// push %ecx %rbx -= 8; 0(%rbx) = %ecx; // pop %ecx %ecx = 0(%rbx); %rbx += 8; </code></pre> <hr> To create a thread, the LWP scheduler had to create a stack area using <code>malloc</code>. It then had to save this pointer in a per-thread struct, and then kick off the child LWP. The actual code is a bit tricky, assume we have an (e.g.) <code>LWP_create</code> function that is similar to <code>pthread_create</code>: <pre class="prettyprint"><code>typedef void * (*LWP_func)(void *); // per-thread control typedef struct tsk tsk_t; struct tsk { tsk_t *tsk_next; // tsk_t *tsk_prev; // void *tsk_stack; // stack base u64 tsk_regsave[16]; }; // list of tasks typedef struct tsklist tsklist_t; struct tsklist { tsk_t *tsk_next; // tsk_t *tsk_prev; // }; tsklist_t tsklist; // list of tasks tsk_t *tskcur; // current thread // LWP_switch -- switch from one task to another void LWP_switch(tsk_t *to) { // NOTE: we use (i.e.) burn register values as we do our work. in a real // implementation, we'd have to push/pop these in a special way. so, just // pretend that we do that ... // save all registers into tskcur->tsk_regsave tskcur->tsk_regsave[RAX] = %rax; // ... tskcur = to; // restore most registers from tskcur->tsk_regsave %rax = tskcur->tsk_regsave[RAX]; // ... // set stack pointer to new task's stack %rsp = tskcur->tsk_regsave[RSP]; // set resume address for task push(%rsp,tskcur->tsk_regsave[RIP]); // issue "ret" instruction ret(); } // LWP_create -- start a new LWP tsk_t * LWP_create(LWP_func start_routine,void *arg) { tsk_t *tsknew; // get per-thread struct for new task tsknew = calloc(1,sizeof(tsk_t)); append_to_tsklist(tsknew); // get new task's stack tsknew->tsk_stack = malloc(0x100000) tsknew->tsk_regsave[RSP] = tsknew->tsk_stack; // give task its argument tsknew->tsk_regsave[RDI] = arg; // switch to new task LWP_switch(tsknew); return tsknew; } // LWP_destroy -- destroy an LWP void LWP_destroy(tsk_t *tsk) { // free the task's stack free(tsk->tsk_stack); remove_from_tsklist(tsk); // free per-thread struct for dead task free(tsk); } </code></pre> <hr> With a kernel that understands threads, we use <code>pthread_create</code> and <code>clone</code>, but we still have to create the new thread's stack. The kernel does not create/assign a stack for a new thread. The <code>clone</code> syscall accepts a <code>child_stack</code> argument. Thus, <code>pthread_create</code> must allocate a stack for the new thread and pass that to <code>clone</code>: <pre class="prettyprint"><code>// pthread_create -- start a new native thread tsk_t * pthread_create(LWP_func start_routine,void *arg) { tsk_t *tsknew; // get per-thread struct for new task tsknew = calloc(1,sizeof(tsk_t)); append_to_tsklist(tsknew); // get new task's stack tsknew->tsk_stack = malloc(0x100000) // start up thread clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg); return tsknew; } // pthread_join -- destroy an LWP void pthread_join(tsk_t *tsk) { // wait for thread to die ... // free the task's stack free(tsk->tsk_stack); remove_from_tsklist(tsk); // free per-thread struct for dead task free(tsk); } </code></pre> <hr> Only a process or main thread is assigned its initial stack by the kernel, usually at a high memory address. So, if the process does not use threads, normally, it just uses that pre-assigned stack. But, if a thread is created, either an LWP or a native one, the starting process/thread must pre-allocate the area for the proposed thread with <code>malloc</code>. Side note: Using <code>malloc</code> is the normal way, but the thread creator could just have a large pool of global memory: <code>char stack_area[MAXTASK][0x100000];</code> if it wished to do it that way. If we had an ordinary program that does not use threads [of any type], it may wish to "override" the default stack it has been given. That process could decide to use <code>malloc</code> and the above assembler trickery to create a much larger stack if it were doing a hugely recursive function. See my answer here: What is the difference between user defined stack and built in stack in use of memory?

User level threads are usually coroutines, in one form or another. Switch context between flows of execution in user mode, with no kernel involvement. From kernel POV, is all one thread. What the thread actually does is controlled in the user mode, and the user mode can suspend, switch, resume logical flows of executions (ie. coroutines). It all happens during the quanta scheduled for the actual thread. Kernel can, and will unceremoniously interrupt the actual thread (kernel thread) and give control of the processor to another thread. User mode coroutines require cooperative multitasking. User mode threads must periodically relinquish control to other user mode threads (basically the execution changes context to the new user mode thread, without the kernel thread ever noticing anything). Usually what happens is that the code knows a whole lot better when it wants to release control that the kernel would. A poorly coded coroutine can steal control and starve all other coroutines. The historical implementation used <code>setcontext</code> but that is now deprecated. Boost.context offers a replacement for it, but is not fully portable: <blockquote> Boost.Context is a foundational library that provides a sort of cooperative multitasking on a single thread. By providing an abstraction of the current execution state in the current thread, including the stack (with local variables) and stack pointer, all registers and CPU flags, and the instruction pointer, a execution_context represents a specific point in the application's execution path. </blockquote> Not surprisingly, Boost.coroutine is based on Boost.context. Windows provided Fibers. .Net runtime has Tasks and async/await.

How are user-level threads scheduled/created, and how are kernel level threads created?

Tags:

c++

c

linux

multithreading

linux-kernel

Apologies if this question is stupid. I tried to find an answer online for quite some time, but couldn't and hence I'm asking here. I am learning threads, and I've been going through this link and this Linux Plumbers Conference 2013 videoabout kernel level and user level threads, and as far as I understood, using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside. In such a case,

who decides the scheduling of these user threads during the timeslice the process gets, as the kernel sees it as a single process and is unaware of the threads, and how is the scheduling done?
If pthreads create user level threads, how are kernel level or OS threads created from user space programs, if required?
According to the above link, it says Operating Systems kernel provides system call to create and manage threads. So does a clone() system call creates a kernel level thread or user level thread?
- If it creates a kernel level thread, then strace of a simple pthreads program also shows using clone() while executing, but then why would it be considered user level thread?
- If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?
According to the link, it says "It require a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity.", so in kernel level threads, only the heap is shared, and the rest all are individual to the thread?

Edit:

I was asking about the user-level thread creation, and it's scheduling because here, there is a reference to Many to One Model where many user level threads are mapped to one Kernel-level thread, and Thread management is done in user space by the thread library. I've been only seeing references to using pthreads, but unsure if it creates user-level or kernel-level threads.

205

asked Aug 27 '16 19:08

init

2 Answers

This is prefaced by the top comments.

The documentation you're reading is generic [not linux specific] and a bit outdated. And, more to the point, it is using different terminology. That is, I believe, the source of the confusion. So, read on ...

What it calls a "user-level" thread is what I'm calling an [outdated] LWP thread. What it calls a "kernel-level" thread is what is called a native thread in linux. Under linux, what is called a "kernel" thread is something else altogether [See below].

using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside.

This was how userspace threads were done prior to the NPTL (native posix threads library). This is also what SunOS/Solaris called an LWP lightweight process.

There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some such]. The kernel was not aware of this. The kernel didn't yet understand or provide support for threads.

But, because, these "lightweight" threads were switched by code in the userspace based thread master (aka "lightweight process scheduler") [just a special user program/process], they were very slow to switch context.

Also, before the advent of "native" threads, you might have 10 processes. Each process gets 10% of the CPU. If one of the processes was an LWP that had 10 threads, these threads had to share that 10% and, thus, got only 1% of the CPU each.

All this was replaced by the "native" threads that the kernel's scheduler is aware of. This changeover was done 10-15 years ago.

Now, with the above example, we have 20 threads/processes that each get 5% of the CPU. And, the context switch is much faster.

It is still possible to have an LWP system under a native thread, but, now, that is a design choice, rather than a necessity.

Further, LWP works great if each thread "cooperates". That is, each thread loop periodically makes an explicit call to a "context switch" function. It is voluntarily relinquishing the process slot so another LWP can run.

However, the pre-NPTL implementation in glibc also had to [forcibly] preempt LWP threads (i.e. implement timeslicing). I can't remember the exact mechanism used, but, here's an example. The thread master had to set an alarm, go to sleep, wake up and then send the active thread a signal. The signal handler would effect the context switch. This was messy, ugly, and somewhat unreliable.

Joachim mentioned pthread_create function creates a kernel thread

That is [technically] incorrect to call it a kernel thread. pthread_create creates a native thread. This is run in userspace and vies for timeslices on an equal footing with processes. Once created there is little difference between a thread and a process.

The primary difference is that a process has its own unique address space. A thread, however, is a process that shares its address space with other process/threads that are part of the same thread group.

If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?

Kernel threads are not userspace threads, NPTL, native, or otherwise. They are created by the kernel via the kernel_thread function. They run as part of the kernel and are not associated with any userspace program/process/thread. They have full access to the machine. Devices, MMU, etc. Kernel threads run in the highest privilege level: ring 0. They also run in the kernel's address space and not the address space of any user process/thread.

A userspace program/process may not create a kernel thread. Remember, it creates a native thread using pthread_create, which invokes the clone syscall to do so.

Threads are useful to do things, even for the kernel. So, it runs some of its code in various threads. You can see these threads by doing ps ax. Look and you'll see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration, etc. These are kernel threads and not programs/processes.

UPDATE:

You mentioned that kernel doesn't know about user threads.

Remember that, as mentioned above, there are two "eras".

(1) Before the kernel got thread support (circa 2004?). This used the thread master (which, here, I'll call the LWP scheduler). The kernel just had the fork syscall.

(2) All kernels after that which do understand threads. There is no thread master, but, we have pthreads and the clone syscall. Now, fork is implemented as clone. clone is similar to fork but takes some arguments. Notably, a flags argument and a child_stack argument.

Craig Estey

User level threads are usually coroutines, in one form or another. Switch context between flows of execution in user mode, with no kernel involvement. From kernel POV, is all one thread. What the thread actually does is controlled in the user mode, and the user mode can suspend, switch, resume logical flows of executions (ie. coroutines). It all happens during the quanta scheduled for the actual thread. Kernel can, and will unceremoniously interrupt the actual thread (kernel thread) and give control of the processor to another thread.

User mode coroutines require cooperative multitasking. User mode threads must periodically relinquish control to other user mode threads (basically the execution changes context to the new user mode thread, without the kernel thread ever noticing anything). Usually what happens is that the code knows a whole lot better when it wants to release control that the kernel would. A poorly coded coroutine can steal control and starve all other coroutines.

The historical implementation used setcontext but that is now deprecated. Boost.context offers a replacement for it, but is not fully portable:

Boost.Context is a foundational library that provides a sort of cooperative multitasking on a single thread. By providing an abstraction of the current execution state in the current thread, including the stack (with local variables) and stack pointer, all registers and CPU flags, and the instruction pointer, a execution_context represents a specific point in the application's execution path.

Not surprisingly, Boost.coroutine is based on Boost.context.

Windows provided Fibers. .Net runtime has Tasks and async/await.

answered Oct 11 '22 06:10

Remus Rusanu

Related questions
                            
                                When should you use an STL other than the one that comes with your compiler?
                            
                                Where is the Makefile generated by the Eclipse CDT?
                            
                                moveToThread vs deriving from QThread in Qt
                            
                                boost::bind with functions that have parameters that are references
                            
                                Why "not all control paths return a value" is warning and not an error?
                            
                                Negative NaN is not a NaN?
                            
                                capture by value class members
                            
                                too many initializers for 'int [0]' c++
                            
                                How to go from fopen to fopen_s
                            
                                extern "C" linkage inside C++ namespace?
                            
                                Converting a void* to a std::string
                            
                                Value-initializing an automatic object?
                            
                                Use a html renderer in an embedded environment [closed]
                            
                                cstdio streams vs iostream streams?
                            
                                overloading base class method in derived class
                            
                                Vector of vectors, reserve
                            
                                What's the reason of using auto self(shared_from_this()) variable in lambda function?
                            
                                Get a single line representation for multiple close by lines clustered together in opencv
                            
                                std::vector::emplace_back and std::move
                            
                                Is there a better alternative to std::remove_if to remove elements from a vector?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With