Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Uses of the monitor/mwait instructions

I happened to stumble upon these two instructions - mwait and monitor https://www.felixcloutier.com/x86/mwait. The intel manual says these are used to wait for writes in a concurrent multi-processor system, and it made me curious what types of usecases were in mind when these instructions were added to the ISA.

What are the semantics of these instructions? Is this integrated through linux into the threading libraries provided by posix (eg. does the thread yield while monitoring a word)? Or are these just fancier versions of the pause instruction? Consequently, what is the relation of these instructions on hyperthreading?

like image 235
Curious Avatar asked Aug 13 '19 05:08

Curious


2 Answers

Uses of monitor/mwait in the Linux kernel

The Linux kernel uses the monitor/mwait instructions in the idle loop, which is executed on a core when there is no runnable task (other than the idle task) that is scheduled to run on the core. These instructions are used in the idle loop on all Intel x86 processors, except in the following situations:

  • The processor doesn't support the instructions. All Intel Core processors starting with the 90nm Pentium 4, all Intel Atom processors, and all Xeon Phi processors support these instructions.
  • The cpuidle subsystem is disabled (it's enabled by default, but can be disabled explicitly using the cpuidle.off=1 kernel parameter) or failed to initialize. In addition, the processor is either not from Intel or it's an Intel processor with the X86_BUG_MONITOR bug. This bug currently exists only in some Goldmont processors, where a core in low-power C-state can only be woken up via an IPI. See: x86: add workaround monitor bug.
  • mwait is disabled in the BIOS setup on a processor that supports the instruction.
  • The idle kernel parameter is used, which takes one of the following values: poll, halt, nomwait. When this parameter is used, the intel_idle driver is not used (i.e., either the acpi_idle driver is used or the cpuidle subsystem is disabled). In the current implementation, nomwait is effectively the same as halt; both use the hlt instruction to put a core to sleep (in state C1). (By the way, there used to be a fourth option, called mwait, but it's been removed since v3.9-rc1 because it was deemed not useful. See the patches 1 and 2.)

Otherwise, these instructions are used to put any logical core in any supported C-state (other than the active state C0, of course). This is the case irrespective of whether the cpuidle subsystem is enabled (except as described above), which cpuidle driver is used, and the value of the intel_idle.max_cstate kernel parameter (which specifies whether to use the intel_idle or acpi_idle driver and what the deepest C-state is allowed).

A cpuidle driver is responsible for determining what power states can be used for each processor, the performance characteristics of each power state (e.g., exit latency, target residency, and power usage in that state), and how to enter each of these states.

When the intel_idle driver is used, the function that is called to entr a particular state on all processors supported by the driver can be found here. It basically works as follows (note that timer interrupts have already been disabled at this point):

  • When entering the C3 state or deeper, the TLB entries of the logical core are flushed, so that the core doesn't get woken up just to handle TLB shootdowns.
  • If the processor has the X86_BUG_CLFLUSH_MONITOR bug, clflush is used to flush the address range armed by the monitor instruction used to exit the sleep state. To my knowledge, the only processor that has this bug is the Intel Xeon Processor 7400 (the bug and the flush workaround are documented in the AAI65 erratum).
  • The monitor instruction is executed with ecx and edx both zero.
  • The buffers that are vulnerable to MDS attacks are flushed (if any). For more information, see this.
  • The mwait instruction is executed with eax containing the target C-state and ecx containing 1 (i.e., exit the state on an interrupt).

When the intel_idle driver is not used (i.e., either acpi_idle is used or the cpuidle subsystem is disabled), the sequence is similar, except that the TLB entries of the core are not flushed. Also the target C-state in eax is always C1.

(You can use the cpupower idle-info and cpupower monitor tools to determine the C-states supported by your processor, which cpuidle driver and governor are active, and some performance and usage characteristics (per core) of each C-state.)

Another case where mwait is used is when soft-offlining a CPU. The way it's used here is similar to what I've discussed for the idle loop (see the code). A CPU is offlined by putting it in the deepest available sleep state. (One important difference though is that all dirty cache lines in the private caches of the physical core that contains the logical core being offlined must be flushed or at least written back. The reason for this (according to this thread) is that cache coherency doesn't work on the private caches if a physical core that is in a C-state deeper than C1. The relevant patch can be found here.)

When waking up the system from hibernation, some of the processors may be configured as offline (e.g., when SMT is disabled, all sibling logical cores must be offline). The cores that were offline before hibernation will still be in the same sleep state when waking up the system, except for the bootstrap processor (BSP). In particular, they can still be woken up by writing to an address in the memory range armed the monitor instruction on the respective cores. To make sure that none of these cores are woken up prematurely (before it's possible to perform address translation to fetch and execute the instruction that follows mwait), the BSP wakes up all the cores, and then puts them offline using the hlt instruction instead. This is not efficient power-wise (because hlt puts the core in C1 only), but is safe correctness-wise. Later, all cores that are supposed to be offline are woken up again and put to deepest sleep using mwait in a safe fashion. This is an example of why you'd want to use hlt instead of mwait, even when mwait is supported.

The AMD Excavator microarchitecture and later support a variant of mwait, called mwaitx, which can be configured with 32-bit timer that counts at the TSC frequency and exits the sleep state when the timer expires. Currently, this instruction is only used to implement the delay APIs including udelay and ndelay. If this instruction is not supported, the delay is implemented by spinning in a loop until the value in the TSC register has increased by about the required number of cycles. The pause instruction is similar, except that sleep time is not configurable.

(Modern Intel processors seem to support timed mwait as well, although I don't think this feature is officially documented by Intel for any of the current processors. Perhaps this explains why the Linux kernel doesn't use it.)

Usually, a core transitions to one of the sleep C-states only on-demand i.e., when it goes offline. It's possible to force a CPU package to be in the package C-state for a specific amount of percentage of time, even if there are runnable threads that can be scheduled on cores of that package. The Intel Powerclamp driver can be used to achieve this via the monitor/mwait instructions.

These are all the uses of these instructions in the Linux kernel that I'm aware of.

Uses of monitor/mwait for thread synchronization

Starting with gcc 9 and kernel v5.3-rc1, the user-mode versions of mwait and monitor, called umwait and umonitor, are exposed through the _umwait and _umonitor intrinsics. To use these intrinsics, include the immintrin.h header and compile with -mwaitpkg. No current processor supports these instructions (the CPUID information in Tremont is correct and the current Intel documentation is wrong about this). The first microarchitecture to support these instructions will probably be Sapphire Rapids. umwait is much less powerful than mwait and its exact behavior can be controlled by the OS through the IA32_UMWAIT_CONTROL MSR. glibc currently doesn't use these instructions.

I think umwait is useful for implementing spinlocks and condition variables, where you want threads to block until the memory location that holds the lock is modified (indicating that the lock has been released). In contrast to mwait, the timer-triggered wakeup is documented for umwait. When implementing a synchronization primitive using umwait, it's important to remember that resuming execution from umwait doesn't necessarily mean that the condition a thread is waiting for is triggered. umwait may wake up due to an interrupt, an expiration of the time limit specified by umwait (which could be overridden by an OS time limit), or other implementation-dependent events. Also if umonitor failed to arm the address range of the primitive, umwait will not even change the C-state. That's why after waking up from umwait, the thread must still perform the necessary checks.

umwait currently supports only two C-states: C0.1 (called light-weight power/performance optimized state) and C0.2 (called improved power/performance optimized state). Both of which are not sleep states. They are basically sub-states of C0. This is similar to pause/tpause, which keep the core in C0. The meaning of C0.1 and C0.2 is currently not documented. I think these sub-states save power by de-pipelining the thread i.e., instructions are not longer fetched for that thread. They can also improve the performance of the other sibling thread because it can now use all of the competitively shared resources without contention. However, partitioned resources are not recombined (which occurs when transitioning to a deeper C-state).

umwait is essentially tpause + the "memory wait" feature of mwait + it causes a transactional abort, like pause, when executed in a transactional region. It's worth noting here that the pause latency is implementation-dependent (it could be zero), which makes it hard to use effectively. I think the only advantage of pause is that it's highly portable; it's supported on the 130nm Pentium 4 and later and it behaves like a nop on all 32-bit and 64-bit Intel and AMD processors that don't support it.

Knights Landing and Knights Mill offer a feature that allows monitor and mwait to be executed in any ring including user-mode. This can be achieved by setting MISC_FEATURE_ENABLES[1] to 1. Linux enables this feature by default on these processors. It can be disabled by passing ring3mwait=disable to the kernel command line (which makes the kernel not set MISC_FEATURE_ENABLES[1] to 1, thereby keeping it at the default 0 value). According to the docs:

If MWAIT is executed when CPL > 0 or in virtual-8086 mode, and if EAX indicates a C-state other than C0 or C1, the instruction operates as if EAX indicated the C-state C1.

Interestingly, mwait here can be used to transition to C1, but umwait can't.

I don't know if this feature on KNL/KNM is used in any program.

Some discussion on the potential of using mwait and monitor for thread synchronization can be found here and here (both of which are very old).

Execution characteristics of monitor/mwait

Both hlt and mwait can be used to enter C1. In this case, the only architectural difference between them (other than they are different instructions) is that following an SMI interrupt, if auto halt restart is enabled, the saved instruction pointer points to the hlt instruction, not the instruction that follows it. So if the interrupt handler wants to return the core to the sleep state, it can just return normally without having to do anything extra. According to 34.10 of Volume 3:

If the HLT instruction is restarted, the processor will generate a memory access to fetch the HLT instruction (if it is not in the internal cache), and execute a HLT bus transaction. This behavior results in multiple HLT bus transactions for the same HLT instruction.

This also applies to AMD processors.

When a logical core enters a sleep state, all resources that are partitioned or reserved for it become available for the sibling core. At the very least, this may improve the performance of the sibling core (in contrast to using a polling loop). If the other sibling core also enters a sleep state, the whole physical core can enter a low-power state. If all physical cores of the same package enter a sleep state, the whole package (including the uncore) can enter a low-power state.

A core in a sleep state (because of executing hlt or mwait) transitions to C0 (the active state) when any of the following events occur:

  • An interrupt occurs (it doesn't have to be affine to the core).
  • An address monitored by the core (by executing monitor on a valid WB address range) is stored to.
  • The timer expires in case of a timed mwait.

You can find this information documented in the datasheets of Intel processors. Of course, there is shitload of errata related to mwait and monitor.

Summary

╔══╦═════════════════════════════════════╦═══════════════════════╦════════════════╦═════════════════╦════════════════╦═════════════════╦══════════════════╗
║  ║                                     ║ mwait                 ║ mwaitx         ║ umwait          ║ pause          ║ tpause          ║ hlt              ║
╠══╩═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Wakeup triggers:                       ║                       ║                ║                 ║                ║                 ║                  ║
╠══╦═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ WB store from a processor agent     ║ +                     ║ +              ║ +               ║ –              ║ –               ║ –                ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ WB store from a non-processor agent ║ No guarantee          ║ No guarantee   ║ +               ║ –              ║ –               ║ –                ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ Non-WB stores                       ║ No guarantee          ║ No guarantee   ║ No guarantee    ║ –              ║ –               ║ –                ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ Unmasked interrupt                  ║ +                     ║ +              ║ +               ║ ?              ║ +               ║ +                ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ Masked interrrupt                   ║ + (1)                 ║ + (1)          ║ + (1)           ║ ?              ║ +               ║ –                ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ Timer                               ║ – (2)                 ║ + (3)          ║ + (4)           ║ –              ║ + (4)           ║ –                ║
╠══╬═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║  ║ Implementation-dependent            ║ +                     ║ –              ║ +               ║ –              ║ +               ║ –                ║
╠══╩═════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ User mode                              ║ – (5)                 ║ +              ║ +               ║ +              ║ +               ║ –                ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Wakeup IP                              ║ Next                  ║ Next           ║ Next            ║ Next           ║ Next            ║ Next or same (6) ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Deepest C-state                        ║ Deepest supported (7) ║ C1             ║ C0.2 (8)        ║ C0 (9)         ║ C0.2 (8)        ║ C1               ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Doesn't abort transaction              ║ +                     ║ N/A            ║ +               ║ –              ║ +               ║ –                ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Real mode                              ║ –                     ║ –              ║ –               ║ –              ║ –               ║ +                ║
╠════════════════════════════════════════╬═══════════════════════╬════════════════╬═════════════════╬════════════════╬═════════════════╬══════════════════╣
║ Support                                ║ 90nm P4+              ║ AMD Excavator+ ║ Atom Tremont,   ║ 130nm P4+ (10) ║ Atom Tremont,   ║ All x86          ║
║                                        ║                       ║                ║ Alder Lake,     ║                ║ Alder Lake,     ║                  ║
║                                        ║                       ║                ║ Sapphire Rapids ║                ║ Sapphire Rapids ║                  ║
╚════════════════════════════════════════╩═══════════════════════╩════════════════╩═════════════════╩════════════════╩═════════════════╩══════════════════╝

(This ASCII art was generated using TablesGenerator.com.)

Notes:
(1) This behavior is configurable via the ecx parameter.
(2) It actually does support a timer, at least on recent micorarchitecturs. However, this feature is not documented.
(3) Wait time is stored in a 32-bit field, in contrast to umwait and tpause where it's stored in a 64-bit field.
(4) A maximum wait time may be specified in IA32_UMWAIT_CONTROL.
(5) On KNL and KNM, setting MISC_FEATURE_ENABLES[1] to 1 allows the instruction to be executed in user mode.
(6) The hlt instruction is re-executed following an SMI if auto halt restart is enabled.
(7) On KNL and KNM, if MISC_FEATURE_ENABLES[1] is 1, the deepest C-state is C1.
(8) IF IA32_UMWAIT_CONTROL[0] is 1, the deepest C-state is C0.1.
(9) According to my understanding.
(10) Behaves as nop on all 32-bit and 64-bit Intel and AMD processors that don't support it.

like image 141
Hadi Brais Avatar answered Nov 04 '22 15:11

Hadi Brais


What are the semantics of these instructions?

The general idea is that instead of having a polling loop (e.g. "while( *foo == 0) {}") you set up the monitor (using monitor) then check the condition, then (if the condition hasn't happened) wait for the monitor to be triggered (using mwait). This allows the CPU to consume less power (and/or lets a different logical processor in the same core run better) while waiting for the condition to change.

However; there can be false positives (writes to something else in the same cache line) and other things (IRQs) that cause mwait to stop waiting. For that reason you still need to check the condition in a loop; so the whole thing ends up like (e.g.) "monitor(foo); while(*foo == 0) { mwait(); }.

Is this integrated through linux into the threading libraries provided by posix (eg. does the thread yield while monitoring a word)?

These instructions typically can't be used in user-space (require CPL=0). Note: There was a proposed extension to allow (a version of) monitor/mwait to be used in user-space, but I'm not sure if it ever got implemented (yet?).

However; they are often used in the kernel's scheduler when there's no tasks that want the CPU (to monitor an empty list of tasks that want the CPU and wake the CPU up when a task gets added to the list). In that way, it could end up being used by higher-level user-space things (e.g. pthread_condvars).

Note: Ages ago (maybe about 5 years?) I remember seeing some research into using monitor/mwait for spinlocks (in kernel); where the conclusion was that it took too long for the CPU to wake up and wasn't worth doing. I'm not sure if anything has changed since.

Or are these just fancier versions of the pause instruction?

The pause instruction is very different - it tells the CPU not to aggressively (speculatively) execute future instruction (and don't tell the CPU to wait/execute no instructions). It's also useful in polling loops, but for different reasons.

Consequently, what is the relation of these instructions on hyperthreading?

If one logical CPU in a core is doing nothing (e.g. mwait, hlt) then the other logical CPU in the core can use the whole core to execute stuff faster.

If one logical CPU in a core is doing less (because pause told the CPU not to be so aggressive with speculative execution) then the other logical CPU in the core can use more of the core to execute stuff little faster.

like image 6
Brendan Avatar answered Nov 04 '22 15:11

Brendan