Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

After a process calls syscall wait(), who will wake it up?

I have a general idea that a process can be in ready_queue where CPU selects candidate to run next. And there are these other queues on which a process waits for (broadly speaking) events. I know from OS courses long time ago that there are wait queues for IO and interrupts. My questions are:

  1. There are many events a process can wait on. Is there a wait queue corresponding to each such event?

  2. Are these wait queues created/destroyed dynamically? If so, which kernel module is responsible for managing these queues? The scheduler? Are there any predefined queues that will always exist?

  3. To eventually get a waiting process off a wait queue, does the kernel have a way of mapping from each actual event (either hardware or software) to the wait queue, and then remove ALL processes on that queue? If so, what mechanisms does a kernel employ?

To give an example:

....
pid = fork();
if (pid == 0) { // child process
    // Do something for a second;
}
else { // parent process
    wait(NULL);
    printf("Child completed.");
}
....

wait(NULL) is a blocking system call. I want to know the rest of the journey the parent process goes through. My take of the story line is as follows, PLEASE correct me if I miss crucial steps or if I am completely wrong:

  1. Normal system call setup through libc runtime. Now parent process is in kernel mode, ready to execute whatever is in wait() syscall.

  2. wait(NULL) creates a wait queue where the kernel can later find this queue.

  3. wait(NULL) puts the parent process onto this queue, creates an entry in some map that says "If I (the kernel) ever receives an software interrupt, signal, or whatever that indicates that the child process is finished, scheduler should come look at this wait queue".

  4. Child process finishes and kernel somehow noticed this fact. Kernel context switches to scheduler, which looks up in the map to find the wait queue where the parent process is on.

  5. Scheduler moves the parent process to ready queue, does its magic and sometime later the parent process is finally selected to run.

  6. Parent process is still in kernel mode, inside wait(NULL) syscall. Now the main job of rest of the syscall is to exit kernel mode and eventually return the parent process to user land.

  7. The process continues its journey on the next instruction, and may later be waiting on other wait queues until it finishes.

PS: I am hoping to know the inner workings of the OS kernel, what stages a process goes through in the kernel and how the kernel interact and manipulate these processes. I do know the semantics and the contract of the wait() Syscall APIs and that is not what I want to know from this question.

like image 385
Rich Avatar asked Oct 14 '16 04:10

Rich


People also ask

What library is wait in C?

BSD Process Wait FunctionsThe GNU C Library defines macros such as WEXITSTATUS so that they will work on either kind of object, and the wait function is defined to accept either type of pointer as its status-ptr argument. These functions are declared in `sys/wait.

What is wait (& status?

The wait() system call suspends execution of the current process until one of its children terminates. The call wait(&status) is equivalent to: waitpid(-1, &status, 0); The waitpid() system call suspends execution of the current process until a child specified by pid argument has changed state.

What does wait NULL return?

wait(NULL) will block the parent process until any of its children has finished. If the child terminates before the parent process reaches wait(NULL) then the child process turns to a zombie process until its parent waits on it and its released from memory.

What does wait () do?

The wait() function will suspend execution of the calling thread until status information for one of its terminated child processes is available, or until delivery of a signal whose action is either to execute a signal-catching function or to terminate the process.


1 Answers

Let's explore the kernel sources. First of all, it seems all the various wait routines (wait, waitid, waitpid, wait3, wait4) end up in the same system call, wait4. These days you can find system calls in the kernel by looking for the macros SYSCALL_DEFINE1 and so, where the number is the number of parameters, which for wait4 is coincidentally 4. Using the google-based freetext search in the Free Electrons Linux Cross Reference we eventually find the definition:

1674 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
1675                 int, options, struct rusage __user *, ru)

Here the macro seems to split each parameter into its type and name. This wait4 routine does some parameter checking, copies them into a wait_opts structure, and calls do_wait(), which is a few lines up in the same file:

1677         struct wait_opts wo;
1705         ret = do_wait(&wo);

1551 static long do_wait(struct wait_opts *wo)

(I'm missing out lines in these excerpts as you can tell by the non-consecutive line numbers). do_wait() sets another field of the structure to the name of a function, child_wait_callback() which is a few lines up in the same file. Another field is set to current. This is a major "global" that points to information held about the current task:

1558         init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
1559         wo->child_wait.private = current;

The structure is then added to a queue specifically designed for a process to wait for SIGCHLD signals, current->signal->wait_chldexit:

1560         add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);

Let's look at current. It is quite hard to find its definition as it varies per architecture, and following it to find the final structure is a bit of a rabbit warren. Eg current.h

  6 #define get_current() (current_thread_info()->task)
  7 #define current get_current()

then thread_info.h

163 static inline struct thread_info *current_thread_info(void)
165         return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);

 55 struct thread_info {
 56         struct task_struct      *task;          /* main task structure */

So current points to a task_struct, which we find in sched.h

1460 struct task_struct {
1461         volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
1659 /* signal handlers */
1660         struct signal_struct *signal;

So we have found current->signal out of current->signal->wait_chldexit, and the struct signal_struct is in the same file:

670 struct signal_struct {
677         wait_queue_head_t       wait_chldexit;  /* for wait4() */

So the add_wait_queue() call we had got to above refers to this wait_chldexit structure of type wait_queue_head_t.

A wait queue is simply an initially empty, doubly-linked list of structures that contain a struct list_head types.h

184 struct list_head {
185         struct list_head *next, *prev;
186 };

The call add_wait_queue() wait.c temporarily locks the structure and via an inline function wait.h calls list_add() which you can find in list.h. This sets the next and prev pointers appropriately to add the new item on the list. An empty list has the two pointers pointing at the list_head structure.

After adding the new entry to the list, the wait4() system call sets a flag that will remove the process from the runnable queue on the next reschedule and calls do_wait_thread():

1573         set_current_state(TASK_INTERRUPTIBLE);
1577                 retval = do_wait_thread(wo, tsk);

This routine calls wait_consider_task() for each child of the process:

1501 static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
1505         list_for_each_entry(p, &tsk->children, sibling) {
1506                 int ret = wait_consider_task(wo, 0, p);

which goes very deep but in fact is just trying to see if any child already satisfies the syscall, and we can return with the data immediately. The interesting case for you is when nothing is found, but there are still running children. We end up calling schedule(), which is when the process gives up the cpu and our system call "hangs" for a future event.

1594                 if (!signal_pending(current)) {
1595                         schedule();
1596                         goto repeat;
1597                 }

When the process is woken up, it will continue with the code after schedule() and again go through all the children to see if the wait condition is satisfied, and probably return to the caller.

What wakes up the process to do that? A child dies and generates a SIGCHLD signal. In signal.c do_notify_parent() is called by a process as it dies:

1566  * Let a parent know about the death of a child.
1572 bool do_notify_parent(struct task_struct *tsk, int sig)
1656         __wake_up_parent(tsk, tsk->parent);

__wake_up_parent() calls __wake_up_sync_key() and uses exactly the wait_chldexit wait queue we set up previously. exit.c

1545 void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
1547         __wake_up_sync_key(&parent->signal->wait_chldexit,
1548                                 TASK_INTERRUPTIBLE, 1, p);

I think we should stop there, as wait() is clearly one of the more complex examples of a system call and the use of wait queues. You can find a simpler presentation of the mechanism in this 3 page Linux Journal article from 2005. Many things have changed, but the principle is explained. You might also buy the books "Linux Device Drivers" and "Linux Kernel Development", or check out the earlier editions of these that can be found online.

For the "Anatomy Of A System Call" on the way from user space to the kernel you might read these lwn articles.


Wait queues are heavily used throughout the kernel whenever a task, needs to wait for some condition. A grep through the kernel sources finds over 1200 calls of init_waitqueue_head() which is how you initialise a waitqueue you have dynamically created by simply kmalloc()-ing the space to hold the structure.

A grep for the DECLARE_WAIT_QUEUE_HEAD() macro finds over 150 uses of this declaration of a static waitqueue structure. There is no intrinsic difference between these. A driver, for example, can choose either method to create a wait queue, often depending on whether it can manage many similar devices, each with their own queue, or is only expecting one device.

No central code is responsible for these queues, though there is common code to simplify their use. A driver, for example, might create an empty wait queue when it is installed and initialised. When you use it to read data from some hardware, it might start the read operation by writing directly into the registers of the hardware, then queue an entry (for "this" task, i.e. current) on its wait queue to give up the cpu until the hardware has the data ready.

The hardware would then interrupt the cpu, and the kernel would call the driver's interrupt handler (registered at initialisation). The handler code would simply call wake_up() on the wait queue, for the kernel to put all tasks on the wait queue back in the run queue.

When the task gets the cpu again, it continues where it left off (in schedule()) and checks that the hardware has completed the operation, and can then return the data to the user.

So the kernel is not responsible for the driver's wait queue, as it only looks at it when the driver calls it to do so. There is no mapping from the hardware interrupt to the wait queue, for example.

If there are several tasks on the same wait queue, there are variants of the wake_up() call that can be used to wake up only 1 task, or all of them, or only those that are in an interruptable wait (i.e. are designed to be able to cancel the operation and return to the user in case of a signal), and so on.

like image 188
meuh Avatar answered Sep 22 '22 08:09

meuh