I have a general idea that a process can be in ready_queue
where CPU selects candidate to run next. And there are these other queues on which a process waits for (broadly speaking) events. I know from OS courses long time ago that there are wait queues for IO and interrupts. My questions are:
There are many events a process can wait on. Is there a wait queue corresponding to each such event?
Are these wait queues created/destroyed dynamically? If so, which kernel module is responsible for managing these queues? The scheduler? Are there any predefined queues that will always exist?
To eventually get a waiting process off a wait queue, does the kernel have a way of mapping from each actual event (either hardware or software) to the wait queue, and then remove ALL processes on that queue? If so, what mechanisms does a kernel employ?
To give an example:
....
pid = fork();
if (pid == 0) { // child process
// Do something for a second;
}
else { // parent process
wait(NULL);
printf("Child completed.");
}
....
wait(NULL)
is a blocking system call. I want to know the rest of the journey the parent process goes through. My take of the story line is as follows, PLEASE correct me if I miss crucial steps or if I am completely wrong:
Normal system call setup through libc runtime. Now parent process is in kernel mode, ready to execute whatever is in wait()
syscall.
wait(NULL)
creates a wait queue where the kernel can later find this queue.
wait(NULL)
puts the parent process onto this queue, creates an entry in some map that says "If I (the kernel) ever receives an software interrupt, signal, or whatever that indicates that the child process is finished, scheduler should come look at this wait queue".
Child process finishes and kernel somehow noticed this fact. Kernel context switches to scheduler, which looks up in the map to find the wait queue where the parent process is on.
Scheduler moves the parent process to ready queue, does its magic and sometime later the parent process is finally selected to run.
Parent process is still in kernel mode, inside wait(NULL)
syscall. Now the main job of rest of the syscall is to exit kernel mode and eventually return the parent process to user land.
The process continues its journey on the next instruction, and may later be waiting on other wait queues until it finishes.
PS: I am hoping to know the inner workings of the OS kernel, what stages a process goes through in the kernel and how the kernel interact and manipulate these processes. I do know the semantics and the contract of the wait() Syscall APIs and that is not what I want to know from this question.
BSD Process Wait FunctionsThe GNU C Library defines macros such as WEXITSTATUS so that they will work on either kind of object, and the wait function is defined to accept either type of pointer as its status-ptr argument. These functions are declared in `sys/wait.
The wait() system call suspends execution of the current process until one of its children terminates. The call wait(&status) is equivalent to: waitpid(-1, &status, 0); The waitpid() system call suspends execution of the current process until a child specified by pid argument has changed state.
wait(NULL) will block the parent process until any of its children has finished. If the child terminates before the parent process reaches wait(NULL) then the child process turns to a zombie process until its parent waits on it and its released from memory.
The wait() function will suspend execution of the calling thread until status information for one of its terminated child processes is available, or until delivery of a signal whose action is either to execute a signal-catching function or to terminate the process.
Let's explore the kernel sources. First of all, it seems all the
various wait routines (wait, waitid, waitpid, wait3, wait4) end up in the
same system call, wait4
. These days you can find system calls in the
kernel by looking for the macros SYSCALL_DEFINE1
and so, where the number
is the number of parameters, which for wait4
is coincidentally 4. Using the
google-based freetext search in the Free Electrons Linux Cross
Reference we eventually find the definition:
1674 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
1675 int, options, struct rusage __user *, ru)
Here the macro seems to split each parameter into its type and name. This
wait4
routine does some parameter checking, copies them into a wait_opts
structure, and calls do_wait()
, which is a few lines up in the same file:
1677 struct wait_opts wo;
1705 ret = do_wait(&wo);
1551 static long do_wait(struct wait_opts *wo)
(I'm missing out lines in these excerpts as you can tell by the
non-consecutive line numbers).
do_wait()
sets another field of the structure to the name of a function,
child_wait_callback()
which is a few lines up in the same file. Another
field is set to current
. This is a major "global" that points to
information held about the current task:
1558 init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
1559 wo->child_wait.private = current;
The structure is then added to a queue specifically designed for a process
to wait for SIGCHLD signals, current->signal->wait_chldexit
:
1560 add_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait);
Let's look at current
. It is quite hard to find its definition as it
varies per architecture, and following it to find the final structure is a
bit of a rabbit warren. Eg current.h
6 #define get_current() (current_thread_info()->task)
7 #define current get_current()
then thread_info.h
163 static inline struct thread_info *current_thread_info(void)
165 return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
55 struct thread_info {
56 struct task_struct *task; /* main task structure */
So current
points to a task_struct
, which we find in sched.h
1460 struct task_struct {
1461 volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
1659 /* signal handlers */
1660 struct signal_struct *signal;
So we have found current->signal
out of current->signal->wait_chldexit
,
and the struct signal_struct
is in the same file:
670 struct signal_struct {
677 wait_queue_head_t wait_chldexit; /* for wait4() */
So the add_wait_queue()
call we had got to above refers to this
wait_chldexit
structure of type wait_queue_head_t
.
A wait queue is simply an initially empty, doubly-linked list of structures that contain a
struct list_head
types.h
184 struct list_head {
185 struct list_head *next, *prev;
186 };
The call add_wait_queue()
wait.c
temporarily locks the structure and via an inline function
wait.h
calls list_add()
which you can find in
list.h.
This sets the next and prev pointers appropriately to add the new item on
the list.
An empty list has the two pointers pointing at the list_head structure.
After adding the new entry to the list, the wait4()
system call sets a
flag that will remove the process from the runnable queue on the next
reschedule and calls do_wait_thread()
:
1573 set_current_state(TASK_INTERRUPTIBLE);
1577 retval = do_wait_thread(wo, tsk);
This routine calls wait_consider_task()
for each child of the process:
1501 static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
1505 list_for_each_entry(p, &tsk->children, sibling) {
1506 int ret = wait_consider_task(wo, 0, p);
which goes very deep but in fact is just trying to see if any child already
satisfies the syscall, and we can return with the data immediately. The
interesting case for you is when nothing is found, but there are still running
children. We end up calling schedule()
, which is when the process gives
up the cpu and our system call "hangs" for a future event.
1594 if (!signal_pending(current)) {
1595 schedule();
1596 goto repeat;
1597 }
When the process is woken up, it will continue with the code after
schedule()
and again go through all the children to see if the wait
condition is satisfied, and probably return to the caller.
What wakes up the process to do that? A child dies and generates a SIGCHLD
signal.
In signal.c
do_notify_parent()
is called by a process as it dies:
1566 * Let a parent know about the death of a child.
1572 bool do_notify_parent(struct task_struct *tsk, int sig)
1656 __wake_up_parent(tsk, tsk->parent);
__wake_up_parent()
calls __wake_up_sync_key()
and uses exactly the
wait_chldexit
wait queue we set up previously.
exit.c
1545 void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
1547 __wake_up_sync_key(&parent->signal->wait_chldexit,
1548 TASK_INTERRUPTIBLE, 1, p);
I think we should stop there, as wait()
is clearly one of the more
complex examples of a system call and the use of wait queues. You can find
a simpler presentation of the mechanism in this 3 page Linux Journal
article from 2005. Many things
have changed, but the principle is explained. You might also buy the books
"Linux Device Drivers" and "Linux Kernel Development", or check out the
earlier editions of these that can be found online.
For the "Anatomy Of A System Call" on the way from user space to the kernel you might read these lwn articles.
Wait queues are heavily used throughout the kernel whenever a task,
needs to wait for some condition. A grep through the kernel sources finds
over 1200 calls of init_waitqueue_head()
which is how you initialise a
waitqueue you have dynamically created by simply kmalloc()
-ing the space
to hold the structure.
A grep for the DECLARE_WAIT_QUEUE_HEAD()
macro finds over 150 uses of
this declaration of a static waitqueue structure. There is no intrinsic
difference between these. A driver, for example, can choose either method
to create a wait queue, often depending on whether it can manage
many similar devices, each with their own queue, or is only expecting one device.
No central code is responsible for these queues, though there is common
code to simplify their use. A driver, for example, might create an empty
wait queue when it is installed and initialised. When you use it to read data from some
hardware, it might start the read operation by writing directly into the
registers of the hardware, then queue an entry (for "this" task, i.e. current
) on its wait queue to give up
the cpu until the hardware has the data ready.
The hardware would then interrupt the cpu, and the kernel would call the
driver's interrupt handler (registered at initialisation). The handler code
would simply call wake_up()
on the wait queue, for the kernel to
put all tasks on the wait queue back in the run queue.
When the task gets the cpu again, it continues where it left off (in
schedule()
) and checks that the hardware has completed the operation, and
can then return the data to the user.
So the kernel is not responsible for the driver's wait queue, as it only looks at it when the driver calls it to do so. There is no mapping from the hardware interrupt to the wait queue, for example.
If there are several tasks on the same wait queue, there are variants of
the wake_up()
call that can be used to wake up only 1 task, or all of
them, or only those that are in an interruptable wait (i.e. are designed to
be able to cancel the operation and return to the user in case of a
signal), and so on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With