Having just finished a book on comp. architecture, I find myself not completely clarified on where the scheduler is running.
What I'm looking to have clarified is where the scheduler is running - does it have it's own core assigned to run that and nothing else, or is the "scheduler" in fact just a more ambiguous algorithm, that it implemented in every thread being executed - ex. upon preemption of thread, a swithToFrom() command is run?
I don't need specifics according to windows x/linux x/mac os x, just in general.
No the scheduler is not run in it's own core. In fact multi-threading was common long before multi-core CPUs were common.
The best way to see how scheduler code interacts with thread code is to start with a simple, cooperative, single-core example.
Suppose thread A
is running and thread B
is waiting on an event. thread A
posts that event, which causes thread B
to become runnable. The event logic has to call the scheduler, and, for the purposes of this example, we assume that it decides to switch to thread B
. At this point in time the call stack will look something like this:
thread_A_main()
post_event(...)
scheduler(...)
switch_threads(threadA, threadB)
switch_threads
will save the CPU state on the stack, save thread A
's stack pointer, and load the CPU stack pointer with the value of thread B
's stack pointer. It will then load the rest of the CPU state from the stack, where the stack is now stack B. At this point, the call stack has become
thread_B_main()
wait_on_event(...)
scheduler(...)
switch_threads(threadB, threadC)
In other words, thread B has now woken up in the state it was in when it previously yielded control to thread C. When switch_threads()
returns, it returns control to thread B
.
These kind of manipulations of the stack pointer usually require some hand-coded assembler.
Add Interrupts
Thread B
is running and a timer interrupts occurs. The call stack is now
thread_B_main()
foo() //something thread B was up to
interrupt_shell
timer_isr()
interrupt_shell
is a special function. It is not called. It is preemptively invoked by the hardware. foo()
did not call interrupt_shell
, so when interrupt_shell
returns control to foo()
, it must restore the CPU state exactly. This is different from a normal function, which returns leaving the CPU state according to calling conventions. Since interrupt_shell
follows different rules to those stated by the calling conventions, it too must be written in assembler.
The main job of interrupt_shell
is to identify the source of the interrupt and call the appropriate interrupt service routine (ISR) which in this case is timer_isr()
, then control is returned to the running thread.
Add preemptive thread switches
Suppose the timer_isr()
decides that it's time for a time-slice. Thread D is to be given some CPU time
thread_B_main()
foo() //something thread B was up to
interrupt_shell
timer_isr()
scheduler()
Now, scheduler()
can't call switch_threads()
at this point because we are in interrupt context. However, it can be called soon after, usually as the last thing interrupt_shell
does. This leaves the thread B
stack saved in this state
thread_B_main()
foo() //something thread B was up to
interrupt_shell
switch_threads(threadB, threadD)
Add Deferred Service Routines
Some OSses do not allow you to do complex logic like scheduling from within ISRs. One solution is to use a deferred service routine (DSR) which runs as higher priority than threads but lower than interrupts. These are used so that while scheduler()
still needs to be protected from being preempted by DSRs, ISRs can be executed without a problem. This reduces the number of places a kernel has to mask (switch off) interrupts to keep it's logic consistent.
I once ported some software from an OS that had DSRs to one that didn't. The simple solution to this was to create a "DSR thread" that ran higher priority than all other threads. The "DSR thread" simply replaces the DSR dispatcher that the other OS used.
Add traps
You may have observed in the examples I've given so far, we are calling the scheduler from both thread and interrupt contexts. There are two ways in and two ways out. It looks a bit weird but it does work. However, moving forward, we may want to isolate our thread code from our Kernel code, and we do this with traps. Here is the event posting redone with traps
thread_A_main()
post_event(...)
user_space_scheduler(...)
trap()
interrupt_shell
kernel_space_scheduler(...)
switch_threads(threadA, threadB)
A trap causes an interrupt or an interrupt-like event. On the ARM CPU they are known as "software interrupts" and this is a good description.
Now all calls to switch_threads()
begin and end in interrupt context, which, incidentally usually happens in a special CPU mode. This is a step towards privilege separation.
As you can see, scheduling wasn't built in a day. You could go on:
Happy reading!
Each core is separately running the kernel, and cooperates with other cores by reading / writing shared memory. One of the shared data structures maintained by the kernel is the list of tasks that are ready to run, and are just waiting for a timeslice to run in.
The kernel's process / thread scheduler runs on the core that needs to figure out what to do next. It's a distributed algorithm with no single decision-making thread.
Scheduling doesn't work by figuring out what task should run on which other CPU. It works by figuring out what this CPU should do now, based on which tasks are ready to run. This happens whenever a thread uses up its timeslice, or makes a system call that blocks. In Linux, even the kernel itself is pre-emptible, so a high-priority task can be run even in the middle of a system call that takes a lot of CPU time to handle. (e.g. checking the permissions on all the parent directories in an open("/a/b/c/d/e/f/g/h/file", ...)
, if they're hot in VFS cache so it doesn't block, just uses a lot of CPU time).
I'm not sure if this is done by having the directory-walking loop in (a function called by) open()
"manually" call schedule()
to see if the current thread should be pre-empted or not. Or maybe just that tasks waking up will have set some kind of hardware time to fire an interrupt, and the kernel in general is pre-emptible if compiled with CONFIG_PREEMPT
.
There's an inter-processor interrupt mechanism to ask another core to schedule something on itself, so the above description is an over-simplification. (e.g. for Linux run_on
to support RCU sync points, and TLB shootdowns when a thread on another core uses munmap
). But it's true that there isn't one "master control program"; generally the kernel on each core decides what that core should be running. (By running the same schedule()
function on a shared data-structure of tasks that are ready to run.)
The scheduler's decision-making is not always as simple as taking the task at the front of the queue: a good scheduler will try to avoid bouncing a thread from one core to another (because its data will be hot in the caches of the core it was last running on, if that was recent). So to avoid cache thrashing, a scheduler algorithm might choose not to run a ready task on the current core if it was just running on a different core, instead leaving it for that other core to get to later. That way a brief interrupt-handler or blocking system call wouldn't result in a CPU migration.
This is especially important in a NUMA system, where running on the "wrong" core will be slower long-term, even once the caches populate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With