I happened to stumble upon these two instructions - mwait
and monitor
https://www.felixcloutier.com/x86/mwait. The intel manual says these are used to wait for writes in a concurrent multi-processor system, and it made me curious what types of usecases were in mind when these instructions were added to the ISA.
What are the semantics of these instructions? Is this integrated through linux into the threading libraries provided by posix (eg. does the thread yield while monitoring a word)? Or are these just fancier versions of the pause instruction? Consequently, what is the relation of these instructions on hyperthreading?
monitor
/mwait
in the Linux kernelThe Linux kernel uses the monitor
/mwait
instructions in the idle loop, which is executed on a core when there is no runnable task (other than the idle task) that is scheduled to run on the core. These instructions are used in the idle loop on all Intel x86 processors, except in the following situations:
cpuidle.off=1
kernel parameter) or failed to initialize. In addition, the processor is either not from Intel or it's an Intel processor with the X86_BUG_MONITOR
bug. This bug currently exists only in some Goldmont processors, where a core in low-power C-state can only be woken up via an IPI. See: x86: add workaround monitor bug.mwait
is disabled in the BIOS setup on a processor that supports the instruction.idle
kernel parameter is used, which takes one of the following values: poll, halt, nomwait. When this parameter is used, the intel_idle driver is not used (i.e., either the acpi_idle driver is used or the cpuidle subsystem is disabled). In the current implementation, nomwait is effectively the same as halt; both use the hlt
instruction to put a core to sleep (in state C1). (By the way, there used to be a fourth option, called mwait, but it's been removed since v3.9-rc1 because it was deemed not useful. See the patches 1 and 2.)Otherwise, these instructions are used to put any logical core in any supported C-state (other than the active state C0, of course). This is the case irrespective of whether the cpuidle subsystem is enabled (except as described above), which cpuidle driver is used, and the value of the intel_idle.max_cstate
kernel parameter (which specifies whether to use the intel_idle or acpi_idle driver and what the deepest C-state is allowed).
A cpuidle driver is responsible for determining what power states can be used for each processor, the performance characteristics of each power state (e.g., exit latency, target residency, and power usage in that state), and how to enter each of these states.
When the intel_idle driver is used, the function that is called to entr a particular state on all processors supported by the driver can be found here. It basically works as follows (note that timer interrupts have already been disabled at this point):
X86_BUG_CLFLUSH_MONITOR
bug, clflush
is used to flush the address range armed by the monitor
instruction used to exit the sleep state. To my knowledge, the only processor that has this bug is the Intel Xeon Processor 7400 (the bug and the flush workaround are documented in the AAI65 erratum).monitor
instruction is executed with ecx
and edx
both zero.mwait
instruction is executed with eax
containing the target C-state and ecx
containing 1 (i.e., exit the state on an interrupt).When the intel_idle driver is not used (i.e., either acpi_idle is used or the cpuidle subsystem is disabled), the sequence is similar, except that the TLB entries of the core are not flushed. Also the target C-state in eax
is always C1.
(You can use the cpupower idle-info
and cpupower monitor
tools to determine the C-states supported by your processor, which cpuidle driver and governor are active, and some performance and usage characteristics (per core) of each C-state.)
Another case where mwait
is used is when soft-offlining a CPU. The way it's used here is similar to what I've discussed for the idle loop (see the code). A CPU is offlined by putting it in the deepest available sleep state. (One important difference though is that all dirty cache lines in the private caches of the physical core that contains the logical core being offlined must be flushed or at least written back. The reason for this (according to this thread) is that cache coherency doesn't work on the private caches if a physical core that is in a C-state deeper than C1. The relevant patch can be found here.)
When waking up the system from hibernation, some of the processors may be configured as offline (e.g., when SMT is disabled, all sibling logical cores must be offline). The cores that were offline before hibernation will still be in the same sleep state when waking up the system, except for the bootstrap processor (BSP). In particular, they can still be woken up by writing to an address in the memory range armed the monitor
instruction on the respective cores. To make sure that none of these cores are woken up prematurely (before it's possible to perform address translation to fetch and execute the instruction that follows mwait
), the BSP wakes up all the cores, and then puts them offline using the hlt
instruction instead. This is not efficient power-wise (because hlt
puts the core in C1 only), but is safe correctness-wise. Later, all cores that are supposed to be offline are woken up again and put to deepest sleep using mwait
in a safe fashion. This is an example of why you'd want to use hlt
instead of mwait
, even when mwait
is supported.
The AMD Excavator microarchitecture and later support a variant of mwait
, called mwaitx
, which can be configured with 32-bit timer that counts at the TSC frequency and exits the sleep state when the timer expires. Currently, this instruction is only used to implement the delay APIs including udelay
and ndelay
. If this instruction is not supported, the delay is implemented by spinning in a loop until the value in the TSC register has increased by about the required number of cycles. The pause
instruction is similar, except that sleep time is not configurable.
(Modern Intel processors seem to support timed mwait
as well, although I don't think this feature is officially documented by Intel for any of the current processors. Perhaps this explains why the Linux kernel doesn't use it.)
Usually, a core transitions to one of the sleep C-states only on-demand i.e., when it goes offline. It's possible to force a CPU package to be in the package C-state for a specific amount of percentage of time, even if there are runnable threads that can be scheduled on cores of that package. The Intel Powerclamp driver can be used to achieve this via the monitor
/mwait
instructions.
These are all the uses of these instructions in the Linux kernel that I'm aware of.
monitor
/mwait
for thread synchronizationStarting with gcc 9 and kernel v5.3-rc1, the user-mode versions of mwait
and monitor
, called umwait
and umonitor
, are exposed through the _umwait and _umonitor intrinsics. To use these intrinsics, include the immintrin.h
header and compile with -mwaitpkg
. No current processor supports these instructions (the CPUID information in Tremont is correct and the current Intel documentation is wrong about this). The first microarchitecture to support these instructions will probably be Sapphire Rapids. umwait
is much less powerful than mwait
and its exact behavior can be controlled by the OS through the IA32_UMWAIT_CONTROL
MSR. glibc currently doesn't use these instructions.
I think umwait
is useful for implementing spinlocks and condition variables, where you want threads to block until the memory location that holds the lock is modified (indicating that the lock has been released). In contrast to mwait
, the timer-triggered wakeup is documented for umwait
. When implementing a synchronization primitive using umwait
, it's important to remember that resuming execution from umwait
doesn't necessarily mean that the condition a thread is waiting for is triggered. umwait
may wake up due to an interrupt, an expiration of the time limit specified by umwait
(which could be overridden by an OS time limit), or other implementation-dependent events. Also if umonitor
failed to arm the address range of the primitive, umwait
will not even change the C-state. That's why after waking up from umwait
, the thread must still perform the necessary checks.
umwait
currently supports only two C-states: C0.1 (called light-weight power/performance optimized state) and C0.2 (called improved power/performance
optimized state). Both of which are not sleep states. They are basically sub-states of C0. This is similar to pause
/tpause
, which keep the core in C0. The meaning of C0.1 and C0.2 is currently not documented. I think these sub-states save power by de-pipelining the thread i.e., instructions are not longer fetched for that thread. They can also improve the performance of the other sibling thread because it can now use all of the competitively shared resources without contention. However, partitioned resources are not recombined (which occurs when transitioning to a deeper C-state).
umwait
is essentially tpause
+ the "memory wait" feature of mwait
+ it causes a transactional abort, like pause
, when executed in a transactional region. It's worth noting here that the pause
latency is implementation-dependent (it could be zero), which makes it hard to use effectively. I think the only advantage of pause
is that it's highly portable; it's supported on the 130nm Pentium 4 and later and it behaves like a nop
on all 32-bit and 64-bit Intel and AMD processors that don't support it.
Knights Landing and Knights Mill offer a feature that allows monitor
and mwait
to be executed in any ring including user-mode. This can be achieved by setting MISC_FEATURE_ENABLES[1]
to 1. Linux enables this feature by default on these processors. It can be disabled by passing ring3mwait=disable
to the kernel command line (which makes the kernel not set MISC_FEATURE_ENABLES[1]
to 1, thereby keeping it at the default 0 value). According to the docs:
If MWAIT is executed when CPL > 0 or in virtual-8086 mode, and if EAX indicates a C-state other than C0 or C1, the instruction operates as if EAX indicated the C-state C1.
Interestingly, mwait
here can be used to transition to C1, but umwait
can't.
I don't know if this feature on KNL/KNM is used in any program.
Some discussion on the potential of using mwait
and monitor
for thread synchronization can be found here and here (both of which are very old).
monitor
/mwait
Both hlt
and mwait
can be used to enter C1. In this case, the only architectural difference between them (other than they are different instructions) is that following an SMI interrupt, if auto halt restart is enabled, the saved instruction pointer points to the hlt
instruction, not the instruction that follows it. So if the interrupt handler wants to return the core to the sleep state, it can just return normally without having to do anything extra. According to 34.10 of Volume 3:
If the HLT instruction is restarted, the processor will generate a memory access to fetch the HLT instruction (if it is not in the internal cache), and execute a HLT bus transaction. This behavior results in multiple HLT bus transactions for the same HLT instruction.
This also applies to AMD processors.
When a logical core enters a sleep state, all resources that are partitioned or reserved for it become available for the sibling core. At the very least, this may improve the performance of the sibling core (in contrast to using a polling loop). If the other sibling core also enters a sleep state, the whole physical core can enter a low-power state. If all physical cores of the same package enter a sleep state, the whole package (including the uncore) can enter a low-power state.
A core in a sleep state (because of executing hlt
or mwait
) transitions to C0 (the active state) when any of the following events occur:
monitor
on a valid WB address range) is stored to.mwait
.You can find this information documented in the datasheets of Intel processors. Of course, there is shitload of errata related to mwait
and monitor
.
╔══╦═════════════════════════════════════╦═══════════════════════╦════════════════╦═════════════════╦════════════════╦═════════════════╦══════════════════╗
║ ║ ║ mwait ║ mwaitx ║ umwait ║ pause ║ tpause ║ hlt ║
╠══╩═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Wakeup triggers: ║ ║ ║ ║ ║ ║ ║
╠══╦═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ WB store from a processor agent ║ + ║ + ║ + ║ – ║ – ║ – ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ WB store from a non-processor agent ║ No guarantee ║ No guarantee ║ + ║ – ║ – ║ – ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ Non-WB stores ║ No guarantee ║ No guarantee ║ No guarantee ║ – ║ – ║ – ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ Unmasked interrupt ║ + ║ + ║ + ║ ? ║ + ║ + ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ Masked interrrupt ║ + (1) ║ + (1) ║ + (1) ║ ? ║ + ║ – ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ Timer ║ – (2) ║ + (3) ║ + (4) ║ – ║ + (4) ║ – ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ ║ Implementation-dependent ║ + ║ – ║ + ║ – ║ + ║ – ║
╠══╩═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ User mode ║ – (5) ║ + ║ + ║ + ║ + ║ – ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Wakeup IP ║ Next ║ Next ║ Next ║ Next ║ Next ║ Next or same (6) ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Deepest C-state ║ Deepest supported (7) ║ C1 ║ C0.2 (8) ║ C0 (9) ║ C0.2 (8) ║ C1 ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Doesn't abort transaction ║ + ║ N/A ║ + ║ – ║ + ║ – ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Real mode ║ – ║ – ║ – ║ – ║ – ║ + ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Support ║ 90nm P4+ ║ AMD Excavator+ ║ Atom Tremont, ║ 130nm P4+ (10) ║ Atom Tremont, ║ All x86 ║
║ ║ ║ ║ Alder Lake, ║ ║ Alder Lake, ║ ║
║ ║ ║ ║ Sapphire Rapids ║ ║ Sapphire Rapids ║ ║
╚════════════════════════════════════════╩═══════════════════════╩════════════════╩═════════════════╩════════════════╩═════════════════╩══════════════════╝
(This ASCII art was generated using TablesGenerator.com.)
Notes:
(1) This behavior is configurable via the ecx
parameter.
(2) It actually does support a timer, at least on recent micorarchitecturs. However, this feature is not documented.
(3) Wait time is stored in a 32-bit field, in contrast to umwait
and tpause
where it's stored in a 64-bit field.
(4) A maximum wait time may be specified in IA32_UMWAIT_CONTROL
.
(5) On KNL and KNM, setting MISC_FEATURE_ENABLES[1]
to 1 allows the instruction to be executed in user mode.
(6) The hlt
instruction is re-executed following an SMI if auto halt restart is enabled.
(7) On KNL and KNM, if MISC_FEATURE_ENABLES[1]
is 1, the deepest C-state is C1.
(8) IF IA32_UMWAIT_CONTROL[0]
is 1, the deepest C-state is C0.1.
(9) According to my understanding.
(10) Behaves as nop
on all 32-bit and 64-bit Intel and AMD processors that don't support it.
What are the semantics of these instructions?
The general idea is that instead of having a polling loop (e.g. "while( *foo == 0) {}
") you set up the monitor (using monitor
) then check the condition, then (if the condition hasn't happened) wait for the monitor to be triggered (using mwait
). This allows the CPU to consume less power (and/or lets a different logical processor in the same core run better) while waiting for the condition to change.
However; there can be false positives (writes to something else in the same cache line) and other things (IRQs) that cause mwait
to stop waiting. For that reason you still need to check the condition in a loop; so the whole thing ends up like (e.g.) "monitor(foo); while(*foo == 0) { mwait(); }
.
Is this integrated through linux into the threading libraries provided by posix (eg. does the thread yield while monitoring a word)?
These instructions typically can't be used in user-space (require CPL=0). Note: There was a proposed extension to allow (a version of) monitor/mwait to be used in user-space, but I'm not sure if it ever got implemented (yet?).
However; they are often used in the kernel's scheduler when there's no tasks that want the CPU (to monitor an empty list of tasks that want the CPU and wake the CPU up when a task gets added to the list). In that way, it could end up being used by higher-level user-space things (e.g. pthread_condvars).
Note: Ages ago (maybe about 5 years?) I remember seeing some research into using monitor
/mwait
for spinlocks (in kernel); where the conclusion was that it took too long for the CPU to wake up and wasn't worth doing. I'm not sure if anything has changed since.
Or are these just fancier versions of the pause instruction?
The pause
instruction is very different - it tells the CPU not to aggressively (speculatively) execute future instruction (and don't tell the CPU to wait/execute no instructions). It's also useful in polling loops, but for different reasons.
Consequently, what is the relation of these instructions on hyperthreading?
If one logical CPU in a core is doing nothing (e.g. mwait
, hlt
) then the other logical CPU in the core can use the whole core to execute stuff faster.
If one logical CPU in a core is doing less (because pause
told the CPU not to be so aggressive with speculative execution) then the other logical CPU in the core can use more of the core to execute stuff little faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With