I have a general idea that a process can be in <code>ready_queue</code> where CPU selects candidate to run next. And there are these other queues on which a process waits for (broadly speaking) events. I know from OS courses long time ago that there are wait queues for IO and interrupts. My questions are: <ol> <li>There are many events a process can wait on. Is there a wait queue corresponding to each such event? </li> <li>Are these wait queues created/destroyed dynamically? If so, which kernel module is responsible for managing these queues? The scheduler? Are there any predefined queues that will always exist?</li> <li>To eventually get a waiting process off a wait queue, does the kernel have a way of mapping from each actual event (either hardware or software) to the wait queue, and then remove ALL processes on that queue? If so, what mechanisms does a kernel employ?</li> </ol> To give an example: <pre class="prettyprint"><code>.... pid = fork(); if (pid == 0) { // child process // Do something for a second; } else { // parent process wait(NULL); printf("Child completed."); } .... </code></pre> <code>wait(NULL)</code> is a blocking system call. I want to know the rest of the journey the parent process goes through. My take of the story line is as follows, PLEASE correct me if I miss crucial steps or if I am completely wrong: <ol> <li>Normal system call setup through libc runtime. Now parent process is in kernel mode, ready to execute whatever is in <code>wait()</code> syscall.</li> <li><code>wait(NULL)</code> creates a wait queue where the kernel can later find this queue.</li> <li><code>wait(NULL)</code> puts the parent process onto this queue, creates an entry in some map that says "If I (the kernel) ever receives an software interrupt, signal, or whatever that indicates that the child process is finished, scheduler should come look at this wait queue".</li> <li>Child process finishes and kernel somehow noticed this fact. Kernel context switches to scheduler, which looks up in the map to find the wait queue where the parent process is on. </li> <li>Scheduler moves the parent process to ready queue, does its magic and sometime later the parent process is finally selected to run.</li> <li>Parent process is still in kernel mode, inside <code>wait(NULL)</code> syscall. Now the main job of rest of the syscall is to exit kernel mode and eventually return the parent process to user land. </li> <li>The process continues its journey on the next instruction, and may later be waiting on other wait queues until it finishes.</li> </ol> PS: I am hoping to know the inner workings of the OS kernel, what stages a process goes through in the kernel and how the kernel interact and manipulate these processes. I do know the semantics and the contract of the wait() Syscall APIs and that is not what I want to know from this question.

Let's explore the kernel sources. First of all, it seems all the various wait routines (wait, waitid, waitpid, wait3, wait4) end up in the same system call, <code>wait4</code>. These days you can find system calls in the kernel by looking for the macros <code>SYSCALL_DEFINE1</code> and so, where the number is the number of parameters, which for <code>wait4</code> is coincidentally 4. Using the google-based freetext search in the Free Electrons <a href="https://elixir.bootlin.com/" rel="nofollow noreferrer">Linux Cross Reference</a> we eventually find the definition: <pre class="prettyprint"><code>1674 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr, 1675 int, options, struct rusage __user *, ru) </code></pre> Here the macro seems to split each parameter into its type and name. This <code>wait4</code> routine does some parameter checking, copies them into a <code>wait_opts</code> structure, and calls <code>do_wait()</code>, which is a few lines up in the same file: <pre class="prettyprint"><code>1677 struct wait_opts wo; 1705 ret = do_wait(&wo); 1551 static long do_wait(struct wait_opts *wo) </code></pre> (I'm missing out lines in these excerpts as you can tell by the non-consecutive line numbers). <code>do_wait()</code> sets another field of the structure to the name of a function, <code>child_wait_callback()</code> which is a few lines up in the same file. Another field is set to <code>current</code>. This is a major "global" that points to information held about the current task: <pre class="prettyprint"><code>1558 init_waitqueue_func_entry(&wo->child_wait, child_wait_callback); 1559 wo->child_wait.private = current; </code></pre> The structure is then added to a queue specifically designed for a process to wait for SIGCHLD signals, <code>current->signal->wait_chldexit</code>: <pre class="prettyprint"><code>1560 add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait); </code></pre> Let's look at <code>current</code>. It is quite hard to find its definition as it varies per architecture, and following it to find the final structure is a bit of a rabbit warren. Eg current.h <pre class="prettyprint"><code> 6 #define get_current() (current_thread_info()->task) 7 #define current get_current() </code></pre> then thread_info.h <pre class="prettyprint"><code>163 static inline struct thread_info *current_thread_info(void) 165 return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE); 55 struct thread_info { 56 struct task_struct *task; /* main task structure */ </code></pre> So <code>current</code> points to a <code>task_struct</code>, which we find in sched.h <pre class="prettyprint"><code>1460 struct task_struct { 1461 volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ 1659 /* signal handlers */ 1660 struct signal_struct *signal; </code></pre> So we have found <code>current->signal</code> out of <code>current->signal->wait_chldexit</code>, and the struct <code>signal_struct</code> is in the same file: <pre class="prettyprint"><code>670 struct signal_struct { 677 wait_queue_head_t wait_chldexit; /* for wait4() */ </code></pre> So the <code>add_wait_queue()</code> call we had got to above refers to this <code>wait_chldexit</code> structure of type <code>wait_queue_head_t</code>. A wait queue is simply an initially empty, doubly-linked list of structures that contain a <code>struct list_head</code> types.h <pre class="prettyprint"><code>184 struct list_head { 185 struct list_head *next, *prev; 186 }; </code></pre> The call <code>add_wait_queue()</code> wait.c temporarily locks the structure and via an inline function wait.h calls <code>list_add()</code> which you can find in list.h. This sets the next and prev pointers appropriately to add the new item on the list. An empty list has the two pointers pointing at the list_head structure. After adding the new entry to the list, the <code>wait4()</code> system call sets a flag that will remove the process from the runnable queue on the next reschedule and calls <code>do_wait_thread()</code>: <pre class="prettyprint"><code>1573 set_current_state(TASK_INTERRUPTIBLE); 1577 retval = do_wait_thread(wo, tsk); </code></pre> This routine calls <code>wait_consider_task()</code> for each child of the process: <pre class="prettyprint"><code>1501 static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk) 1505 list_for_each_entry(p, &tsk->children, sibling) { 1506 int ret = wait_consider_task(wo, 0, p); </code></pre> which goes very deep but in fact is just trying to see if any child already satisfies the syscall, and we can return with the data immediately. The interesting case for you is when nothing is found, but there are still running children. We end up calling <code>schedule()</code>, which is when the process gives up the cpu and our system call "hangs" for a future event. <pre class="prettyprint"><code>1594 if (!signal_pending(current)) { 1595 schedule(); 1596 goto repeat; 1597 } </code></pre> When the process is woken up, it will continue with the code after <code>schedule()</code> and again go through all the children to see if the wait condition is satisfied, and probably return to the caller. What wakes up the process to do that? A child dies and generates a SIGCHLD signal. In signal.c <code>do_notify_parent()</code> is called by a process as it dies: <pre class="prettyprint"><code>1566 * Let a parent know about the death of a child. 1572 bool do_notify_parent(struct task_struct *tsk, int sig) 1656 __wake_up_parent(tsk, tsk->parent); </code></pre> <code>__wake_up_parent()</code> calls <code>__wake_up_sync_key()</code> and uses exactly the <code>wait_chldexit</code> wait queue we set up previously. exit.c <pre class="prettyprint"><code>1545 void __wake_up_parent(struct task_struct *p, struct task_struct *parent) 1547 __wake_up_sync_key(&parent->signal->wait_chldexit, 1548 TASK_INTERRUPTIBLE, 1, p); </code></pre> I think we should stop there, as <code>wait()</code> is clearly one of the more complex examples of a system call and the use of wait queues. You can find a simpler presentation of the mechanism in this 3 page <a href="http://www.linuxjournal.com/article/8144" rel="nofollow noreferrer">Linux Journal article</a> from 2005. Many things have changed, but the principle is explained. You might also buy the books "Linux Device Drivers" and "Linux Kernel Development", or check out the earlier editions of these that can be found online. For the "Anatomy Of A System Call" on the way from user space to the kernel you might read these lwn articles. <hr> Wait queues are heavily used throughout the kernel whenever a task, needs to wait for some condition. A grep through the kernel sources finds over 1200 calls of <code>init_waitqueue_head()</code> which is how you initialise a waitqueue you have dynamically created by simply <code>kmalloc()</code>-ing the space to hold the structure. A grep for the <code>DECLARE_WAIT_QUEUE_HEAD()</code> macro finds over 150 uses of this declaration of a static waitqueue structure. There is no intrinsic difference between these. A driver, for example, can choose either method to create a wait queue, often depending on whether it can manage many similar devices, each with their own queue, or is only expecting one device. No central code is responsible for these queues, though there is common code to simplify their use. A driver, for example, might create an empty wait queue when it is installed and initialised. When you use it to read data from some hardware, it might start the read operation by writing directly into the registers of the hardware, then queue an entry (for "this" task, i.e. <code>current</code>) on its wait queue to give up the cpu until the hardware has the data ready. The hardware would then interrupt the cpu, and the kernel would call the driver's interrupt handler (registered at initialisation). The handler code would simply call <code>wake_up()</code> on the wait queue, for the kernel to put all tasks on the wait queue back in the run queue. When the task gets the cpu again, it continues where it left off (in <code>schedule()</code>) and checks that the hardware has completed the operation, and can then return the data to the user. So the kernel is not responsible for the driver's wait queue, as it only looks at it when the driver calls it to do so. There is no mapping from the hardware interrupt to the wait queue, for example. If there are several tasks on the same wait queue, there are variants of the <code>wake_up()</code> call that can be used to wake up only 1 task, or all of them, or only those that are in an interruptable wait (i.e. are designed to be able to cancel the operation and return to the user in case of a signal), and so on.

After a process calls syscall wait(), who will wake it up?

Tags:

process

scheduling

system-calls

I have a general idea that a process can be in ready_queue where CPU selects candidate to run next. And there are these other queues on which a process waits for (broadly speaking) events. I know from OS courses long time ago that there are wait queues for IO and interrupts. My questions are:

There are many events a process can wait on. Is there a wait queue corresponding to each such event?
Are these wait queues created/destroyed dynamically? If so, which kernel module is responsible for managing these queues? The scheduler? Are there any predefined queues that will always exist?
To eventually get a waiting process off a wait queue, does the kernel have a way of mapping from each actual event (either hardware or software) to the wait queue, and then remove ALL processes on that queue? If so, what mechanisms does a kernel employ?

To give an example:

....
pid = fork();
if (pid == 0) { // child process
    // Do something for a second;
}
else { // parent process
    wait(NULL);
    printf("Child completed.");
}
....

wait(NULL) is a blocking system call. I want to know the rest of the journey the parent process goes through. My take of the story line is as follows, PLEASE correct me if I miss crucial steps or if I am completely wrong:

Normal system call setup through libc runtime. Now parent process is in kernel mode, ready to execute whatever is in wait() syscall.
wait(NULL) creates a wait queue where the kernel can later find this queue.
wait(NULL) puts the parent process onto this queue, creates an entry in some map that says "If I (the kernel) ever receives an software interrupt, signal, or whatever that indicates that the child process is finished, scheduler should come look at this wait queue".
Child process finishes and kernel somehow noticed this fact. Kernel context switches to scheduler, which looks up in the map to find the wait queue where the parent process is on.
Scheduler moves the parent process to ready queue, does its magic and sometime later the parent process is finally selected to run.
Parent process is still in kernel mode, inside wait(NULL) syscall. Now the main job of rest of the syscall is to exit kernel mode and eventually return the parent process to user land.
The process continues its journey on the next instruction, and may later be waiting on other wait queues until it finishes.

PS: I am hoping to know the inner workings of the OS kernel, what stages a process goes through in the kernel and how the kernel interact and manipulate these processes. I do know the semantics and the contract of the wait() Syscall APIs and that is not what I want to know from this question.

385

asked Oct 14 '16 04:10

Rich

1 Answers

Let's explore the kernel sources. First of all, it seems all the various wait routines (wait, waitid, waitpid, wait3, wait4) end up in the same system call, wait4. These days you can find system calls in the kernel by looking for the macros SYSCALL_DEFINE1 and so, where the number is the number of parameters, which for wait4 is coincidentally 4. Using the google-based freetext search in the Free Electrons Linux Cross Reference we eventually find the definition:

1674 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
1675                 int, options, struct rusage __user *, ru)

Here the macro seems to split each parameter into its type and name. This wait4 routine does some parameter checking, copies them into a wait_opts structure, and calls do_wait(), which is a few lines up in the same file:

1677         struct wait_opts wo;
1705         ret = do_wait(&wo);

1551 static long do_wait(struct wait_opts *wo)

(I'm missing out lines in these excerpts as you can tell by the non-consecutive line numbers). do_wait() sets another field of the structure to the name of a function, child_wait_callback() which is a few lines up in the same file. Another field is set to current. This is a major "global" that points to information held about the current task:

1558         init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
1559         wo->child_wait.private = current;

The structure is then added to a queue specifically designed for a process to wait for SIGCHLD signals, current->signal->wait_chldexit:

1560         add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);

Let's look at current. It is quite hard to find its definition as it varies per architecture, and following it to find the final structure is a bit of a rabbit warren. Eg current.h

  6 #define get_current() (current_thread_info()->task)
  7 #define current get_current()

then thread_info.h

163 static inline struct thread_info *current_thread_info(void)
165         return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);

 55 struct thread_info {
 56         struct task_struct      *task;          /* main task structure */

So current points to a task_struct, which we find in sched.h

1460 struct task_struct {
1461         volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
1659 /* signal handlers */
1660         struct signal_struct *signal;

So we have found current->signal out of current->signal->wait_chldexit, and the struct signal_struct is in the same file:

670 struct signal_struct {
677         wait_queue_head_t       wait_chldexit;  /* for wait4() */

So the add_wait_queue() call we had got to above refers to this wait_chldexit structure of type wait_queue_head_t.

A wait queue is simply an initially empty, doubly-linked list of structures that contain a struct list_head types.h

184 struct list_head {
185         struct list_head *next, *prev;
186 };

The call add_wait_queue() wait.c temporarily locks the structure and via an inline function wait.h calls list_add() which you can find in list.h. This sets the next and prev pointers appropriately to add the new item on the list. An empty list has the two pointers pointing at the list_head structure.

After adding the new entry to the list, the wait4() system call sets a flag that will remove the process from the runnable queue on the next reschedule and calls do_wait_thread():

1573         set_current_state(TASK_INTERRUPTIBLE);
1577                 retval = do_wait_thread(wo, tsk);

This routine calls wait_consider_task() for each child of the process:

1501 static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
1505         list_for_each_entry(p, &tsk->children, sibling) {
1506                 int ret = wait_consider_task(wo, 0, p);

which goes very deep but in fact is just trying to see if any child already satisfies the syscall, and we can return with the data immediately. The interesting case for you is when nothing is found, but there are still running children. We end up calling schedule(), which is when the process gives up the cpu and our system call "hangs" for a future event.

1594                 if (!signal_pending(current)) {
1595                         schedule();
1596                         goto repeat;
1597                 }

When the process is woken up, it will continue with the code after schedule() and again go through all the children to see if the wait condition is satisfied, and probably return to the caller.

What wakes up the process to do that? A child dies and generates a SIGCHLD signal. In signal.c do_notify_parent() is called by a process as it dies:

1566  * Let a parent know about the death of a child.
1572 bool do_notify_parent(struct task_struct *tsk, int sig)
1656         __wake_up_parent(tsk, tsk->parent);

__wake_up_parent() calls __wake_up_sync_key() and uses exactly the wait_chldexit wait queue we set up previously. exit.c

1545 void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
1547         __wake_up_sync_key(&parent->signal->wait_chldexit,
1548                                 TASK_INTERRUPTIBLE, 1, p);

I think we should stop there, as wait() is clearly one of the more complex examples of a system call and the use of wait queues. You can find a simpler presentation of the mechanism in this 3 page Linux Journal article from 2005. Many things have changed, but the principle is explained. You might also buy the books "Linux Device Drivers" and "Linux Kernel Development", or check out the earlier editions of these that can be found online.

For the "Anatomy Of A System Call" on the way from user space to the kernel you might read these lwn articles.

Wait queues are heavily used throughout the kernel whenever a task, needs to wait for some condition. A grep through the kernel sources finds over 1200 calls of init_waitqueue_head() which is how you initialise a waitqueue you have dynamically created by simply kmalloc()-ing the space to hold the structure.

A grep for the DECLARE_WAIT_QUEUE_HEAD() macro finds over 150 uses of this declaration of a static waitqueue structure. There is no intrinsic difference between these. A driver, for example, can choose either method to create a wait queue, often depending on whether it can manage many similar devices, each with their own queue, or is only expecting one device.

No central code is responsible for these queues, though there is common code to simplify their use. A driver, for example, might create an empty wait queue when it is installed and initialised. When you use it to read data from some hardware, it might start the read operation by writing directly into the registers of the hardware, then queue an entry (for "this" task, i.e. current) on its wait queue to give up the cpu until the hardware has the data ready.

The hardware would then interrupt the cpu, and the kernel would call the driver's interrupt handler (registered at initialisation). The handler code would simply call wake_up() on the wait queue, for the kernel to put all tasks on the wait queue back in the run queue.

When the task gets the cpu again, it continues where it left off (in schedule()) and checks that the hardware has completed the operation, and can then return the data to the user.

So the kernel is not responsible for the driver's wait queue, as it only looks at it when the driver calls it to do so. There is no mapping from the hardware interrupt to the wait queue, for example.

If there are several tasks on the same wait queue, there are variants of the wake_up() call that can be used to wake up only 1 task, or all of them, or only those that are in an interruptable wait (i.e. are designed to be able to cancel the operation and return to the user in case of a signal), and so on.

188

answered Sep 22 '22 08:09

meuh

Related questions
                            
                                Java code to run .exe shortcuts
                            
                                Kill all child processes of a parent but leave the parent alive
                            
                                Forking and piping processes in c
                            
                                "Name 'Process' does not exist in current context" error in console app
                            
                                Erlang VM killed when creating millions of processes
                            
                                Redirect output of child process spawned from Rust
                            
                                How to create a memory dump of a Mac OS Application
                            
                                How to prefix every line of the output of a child_process spawn() call with text?
                            
                                Debugging utilities for Linux process hang issues?
                            
                                communication between python programs
                            
                                Process.Start Help in Java
                            
                                How to exit a process run with C++ if it takes more than 5 seconds?
                            
                                Check if current process is running using Python?
                            
                                Executing an in memory jar
                            
                                fork() why not infinite output
                            
                                Can I make readProcess drop the quotes?
                            
                                Execute a command using ProcessBuilder from windows
                            
                                Gradle Exec task and process output
                            
                                How Stack or memory is allocated for threads under the same process in Linux
                            
                                xcode multiple main methods

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With