From what I understand, on processors that doesn't have hardware support for guest virtual to host physical address translation KVM uses the shadow page table.
Shadow page table is built and updated when the guest OS modifies its page tables. Are there special instructions in the hardware (let’s take x86 for reference) for modifying the page table? Unless there are special instructions there won't be a trap to the VMM. Isn't the page table maintained in software by the Linux kernel just another data structure? Why would it need special instructions to update it?
Thanks!
I work with another VMM than KVM, so I don't know the details of KVM, but the principle is the same for all VMM's. The way it works is that there are two sets of page-tables.
There are no special instructions to manage page-tables aside from the special register for the page-table base address [and some random bits in other registers to do with configuring the processor in general, but that's typically a "one off" setup]. Page tables are just bits of memory that are written to with regular instrucitons - you can do add, subtract, and, or, multiply etc, if you really want [it'll most likely cause problems unless you absolutely know what you are doing!], but the typical operato is a "mov" (store) or a "xchg" (exchange) operation.
The first the pagetable is the one actually written by the OS. The VMM sets this up as read-only memory, so whenever there is a write to it, the processor page-faults. Since KVM uses hardware virtualization extensions in the processor (SVM on AMD processors or VMX on Intel processors), the page-fault is captured by the VMM (KVM in this case), where the write operation is inspected to see if it's a "page-table write", if so, it is translated to the second, shadow page-table - this is how the VMM makes the VM believe that memory starts at 0 and goes to 1GB, but in reality we've taken a bunch of pages all over the place and put together a 1GB of memory that appear to be a flat, consecutive set of pages. Of course, since the VMM is "lying" to the OS inside the VM, we can't let the OS write it's REAL page-tables, since it wouldn't know the "true" page-table value to write there. [But we do need to also let the OS have its own page-tables, in case it were to read from the page-table and be utterly confused when it isn't what the OS actually expects].
The processors "real CR3" is set by the VMM, and points at the shadow page-table.
The VMM will trap on CR3 (page-table base-address) writes, so that it can track where page-tables live (and keep track of which "real CR3" to use). However, the VMM doesn't need to know about reads of CR3, so they are usually allowed to happen directly in the VM without intercepting it.
The whole point of the VMM extensions in the processors is to support this sort of intercepting of special instructions, while still running most of the privileged instructions in the VM as "regular" instructions - you wouldn't, for example, want to jump into the VMM for every write to the flags register to enable/disable interrupts, etc - let that happen in the VM as if it was a real piece of hardware. But some registers are critical that the VMM can control.
Obviously, when there is hardware support for the page-tables, then there is two layers of page-tables. One that translates the "0-1GB" into "scattered all over the place", and the other being the actual page-table that the OS maintains. In this case, there is no need to intercept any of the page-table writes, page-faults or any of the CR3 updates - the OS can do what it likes within it's allowed sections of memory that is mapped by the underlying page-tables, and if the VM walks outside the allowed section, the VMM will catch that as a "VMM page-table fault". Which of course makes the whole thing quite a bit more efficient.
I hope this makes sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With