I am wondering how is sleep/nanosleep internally implemented? Consider this code:
{ // on a thread other than main() thread
while(1)
{
//do something
sleep(1);
}
}
would the CPU be doing constant context switching to check if sleep of 1 sec is done (i.e. an internal busy wait).
I doubt it works this way, too much inefficiency. But then how does it work?
Same question applies to nanosleep.
Note: If this is implementation/OS specific, then how can I possibly implement a more efficient scheme that doesn't lead to a constant context switching?
The typical way to implement sleep()
and nanosleep()
is to convert the argument into whatever scale the OS's scheduler uses (while rounding up) and add the current time to it to form an "absolute wake up time"; then tell the scheduler not to give the thread CPU time until after that "absolute wake up time" has been reached. No busy waiting is involved.
Note that whatever scale the OS's scheduler uses typically depends on what hardware is available and/or being used for time keeping. It can be smaller than a nanosecond (e.g. local APIC on 80x86 being used in "TSC deadline mode") or as large as 100 ms.
Also note that the OS guarantees that the delay won't be less than what you ask for; but there's typically no guarantee that it won't be longer and in some cases (e.g. low priority thread on a heavily loaded system) the delay can be much much larger than requested. For example, if you ask to sleep for 123 nanoseconds then you might sleep for 2 ms before the scheduler decides it can give you CPU time, and then it might be another 500 ms before the scheduler actually does give you CPU time (e.g. because other threads are using the CPU).
Some OSs may try to reduce this "slept much longer than requested" problem, and some OSs (e.g. designed for hard-real time) may provide some sort of guarantee (with restrictions - e.g. subject to thread priority) for the minimum time between delay expiry and getting CPU back. To do this, the OS/kernel would convert the argument into whatever scale the OS's scheduler uses (while rounding down and not rounding up) and may subtract a tiny amount "just in case"; so that the scheduler wakes the thread up just before the requested delay expires (and not after); and then when the thread is given CPU time (after the cost of the context switch to the thread, and possibly after pre-fetching various cache lines the thread is guaranteed to use) the kernel would busy wait briefly until the delay has actually expired. This allows the kernel to pass control back to the thread extremely close to delay expiry.
For example, if you ask to sleep for 123 nanoseconds, then scheduler might not give you CPU time for 100 nanoseconds, then it might spend 10 nanoseconds switching to your thread, then it might busy wait for the remaining 13 nanoseconds. Even in this case (where busy waiting is done) it normally won't busy wait for the full duration of the delay. However, if the delay is extremely short the kernel would only do the final busy waiting.
Finally, there is a special case that may be worth mentioning. On POSIX systems sleep(0);
is typically abused as a yield()
. I'm not too sure how legitimate this practice is - it's impossible for a scheduler to support something like yield()
unless that scheduler is willing to waste CPU time doing unimportant work while more important work waits.
The POSIX specification of sleep
and nanosleep
say (emphasis mine)
The sleep() function shall cause the calling thread to be suspended from execution until either the number of realtime seconds specified by the argument seconds has elapsed or a signal is delivered to the calling thread and its action is to invoke a signal-catching function or to terminate the process. The suspension time may be longer than requested due to the scheduling of other activity by the system.
(Source: http://pubs.opengroup.org/onlinepubs/9699919799/functions/sleep.html.)
and
The nanosleep() function shall cause the current thread to be suspended from execution until either the time interval specified by the rqtp argument has elapsed or a signal is delivered to the calling thread, and its action is to invoke a signal-catching function or to terminate the process. The suspension time may be longer than requested because the argument value is rounded up to an integer multiple of the sleep resolution or because of the scheduling of other activity by the system. But, except for the case of being interrupted by a signal, the suspension time shall not be less than the time specified by rqtp, as measured by the system clock CLOCK_REALTIME.
(Source: http://pubs.opengroup.org/onlinepubs/9699919799/functions/nanosleep.html.)
I read that to say that a POSIX-compliant system cannot use a busy loop for sleep
or nanosleep
. The calling thread needs to be suspended from execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With