Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

linux kernel page table update

In linux x86 paging.

  1. each process has it's own page directory.

  2. page table walking starts with page directory which is pointed by CR3.

  3. every process shares the kernel page directory content

assuming three sentences are correct, let's say some process enters kernel mode and updates his kernel page directory content(address mapping, access rights, etc...)

Question. since kernel address spaces is globally shared among processes, this update has to be synchronized with other process's page directory, right?

how can this be managed?

like image 479
daehee Avatar asked Sep 29 '22 13:09

daehee


2 Answers

I don't know about Linux, so I'll answer for Windows. Some of the kernel space is 'global', which is a flag set in the PTE to indicate it is used by more than one process. The INVPCID instruction can be configured in the register operand to include or exclude these entries in a TLB invalidate. These page table entries are shared between the processes and all appear at the same place in the page table for each process. This way, only the single PTE needs to be updated and it doesn't need to synchronise other PTEs of other processes as they all share a single PTE at a physical address.

http://www.cs.miami.edu/home/burt/journal/NT/memory.html

Some kernel memory is not visible to all processes and is private to each process (doesn't change the fact it is still ring 0). This, on a 32 bit Windows system would be 0xC0000000–0xC0200000 which contains all the user space PTEs and PDEs where 0xC0000000 is the PTE_BASE which allows for the equation

#define MiGetPteAddress (x)    ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + (ULONG_PTR)MmPteBase))
#define MiAddressToPte(x) MiGetPteAddress(x)

to work elegantly for converting faulting virtual address in cr2 to the address of the PTE. This is private to each process as each process has the same base PTE allocation base address; if it were visible to all processes it would quickly take up virtual memory as each set of page takes would have to be allocated sequentially. It doesn't need to be visible to all processes because a process has no interest in the page table entries of another process. A page fault is always handled in the context of the current process, and 0xC0000000–0xC0200000 means something different in each process context.

The kernel space 0xC0200000–0xC0400000 for allocation of kernel PTEs (for kernel addresses) would however be global and shared by all processes, except for the section within it representing 0xC0000000–0xC0200000, which by my calculation will be 0xC0300000–0xC0300800, which is the user-mode side of the PDEs as PDE_BASE = 0xC0300000–0xC0300FFF.

It is however impossible to split up the user PDE and kernel PDE section such that the former is private and the latter is global (i.e. make 0xC0300000–0xC0300800 private (point to different physical addresses) and 0xC0300000–0xC0300FFF point to the same physical address for each process) because the whole PDE region (0xC0300000–0xC0300FFF) will lie on the same physical frame and constitutes a single frame pointed to by cr3, and the cr3 is different for each process, which means that the whole PDE region (all PDEs) would have to private per process (duplicated and installed per process). If a kernel page table page (a page containing a kernel page table) were paged out and in to a new physical location then the PDEs would all have to be synchronised because all processes have copies at different cr3 physical addresses and not the same physical PDE. I'm not sure how it does this (efficiently) ATM therefore it would be wise to impose the restriction of not allowing the kernel page tables to be paged out and have them in non-paged pool; this way the kernel PDEs will remain constant across all CR3 pages. On 64 bit, the restriction could be imposed that kernel PDPTs can't be paged out. On 32 bit Windows, a process is started with a physical CR3 page with a PDE at offset 1100000000(base 2)*4 bytes pointing to itself which is hardwritten in, probably by briefly turning off paging in cr0 (because the write won't succeed without the recursive entry that needs to be written being there, creating a paradox). Notice, the PD Entry for itself is the page table that covers the range 0xC0000000–0xC0400000 i.e. it points to 1023 page tables and 1 page directory (itself) (2^10 entries) and hence allows the PTEs to be modified by their virtual address. The reason why the CR3 page is at 0xC0300000 is because the address has the same page directory and page table indexes 1100000000 and 1100000000 so it loops back on itself twice, therefore yielding the CR3 page and you can modify the PDEs by address (there are other addresses that are special like this e.g. 0xE0380000). After it is set up, the appropriate kernel mappings are made. On 64 bit Windows it would be similar where a process is set up with a single PML4 table page which points to itself and this way any PML4E, PDPTE, PDE or PTE can be filled in and accessed due to the variable amount of loopbacks. On 64 bit Windows, when a process is terminated, all the physical pages of the process get moved to the free list which would include all user physical PDPT pages, PD pages, PT pages and the PML4/CR3 page. The kernel ones would not be marked for the free list.

In general, if you know what entry in the PML4 is the recursive entry to the physical PML4 page you, can work out the virtual address of the PTE structure that services (is used to translate) a particular virtual address range and a particular virtual address in that range. You append the offset (10 bits for 32 bit; 9 bits for 64 bit) in the PML4 to the entry to itself, to the start of a virtual address whose servicing PTE virtual address you want to find (which is what the addition of 0xC0000000 is in the 32 bit equation earlier) and remove the last 12 bits and then make up the offset in the PT now at the end of the virtual address to 12 bits by multiplying it by 8 (or 4) (hence the right shift by 12 and the left shift by 3 (or 2 for 32 bit entries)). 1 loopback takes away 1 layer of indirection and you get the virtual address of the PTE. 2 loopbacks will leave you with the virtual address of the PDE that's used to translate that particular virtual address and so on. PTE_BASE on 32 bit windows is the offset 110000000 left shifted to make 32 bits and PDE_BASE is the offset 110000000110000000 left shifted to make 32 bits. It is used in the macro and any virtual address with this prefix will by definition be part of a PTE or a PDE respectively. Windows chooses the offset 1100000000 for the page table hierarchy but it could be any one of the 2^9 combinations.

KAISER, or KPTI, designed to mitigate meltdown, most likely has 2 cr3s for each process. Upon trapping to the kernel, the restricted cr3 for user mode which would contain a single kernel PML4E—enough for a preliminary interrupt dispatch routine function to be accessible, which performs the swap—would be replaced with the full cr3 containing all kernel PML4Es.

As for physical memory on windows, see here: https://superuser.com/a/1549970/933117

like image 198
Lewis Kelsey Avatar answered Oct 18 '22 11:10

Lewis Kelsey


Question. since kernel address spaces is globally shared among processes, this update has to be synchronized with other process's page directory, right?

how can this be managed?

First; understand that paging is usually 2 or more levels of tables. For example (for 80x86), for the oldest "plain 32-bit paging" there are page tables and page directories; and for current long mode there's page map level 4, page directory pointer table, page directory and page table. CR3 points to the highest level table and that must be different for each virtual address space ("process"). For the second highest level table, a single second highest level table can be put into all highest level tables, and if you do that any changes to the second highest level table will automatically change every virtual address space.

This means that (for 80x86), for the oldest "plain 32-bit paging" you can put the same "kernel page table" into all virtual address spaces (all page directories) and when you add/remove pages from that page table it will automatically affect all virtual address spaces; and for current long mode you can put the same page directory pointer table in all virtual address spaces (all page map level 4 tables) and when you add/remove page directories, page tables, or pages, it will automatically affect all virtual address spaces.

This means that you only really need some way to change second highest level page tables (or, some way to change all highest level page table entries). There are multiple ways to do this. The easiest is pre-allocation. For example, if you say "kernel space will always be N MiB" you can pre-allocate all the second highest level tables you'd need for "N MiB" during boot and never change them (e.g. for long mode, you could say that kernel space will be 512 GiB, pre-allocate a single "kernel page directory pointer table", and put that into every page map level 4 when a virtual address space is being created, and then rely on all other changes (to page directories, page tables, etc) automatically affecting the kernel space for all virtual address spaces). I believe this is the method Linux uses (partly because Linux uses the silly "map all RAM into kernel space" security disaster at boot).

However, this is just the table changes alone. There are 2 other concerns.

The first "other concern" is the CPU's translation look-aside buffers (TLBs); which need to be flushed when (virtual address to physical address) translation/s change. Most operating systems use a combination of "lazy TLB shootdown" (where a CPU using wrong information from TLB causes a page fault and page fault handler invalidates and returns so the software that caused the page fault can continue with the new/correct translation without knowing anything happened) and "multi-CPU TLB shootdown" (where you send an inter-processor interrupt to other CPUs and that interrupt handler invalidates the TLB entries).

The second "other concern" is making sure CPUs don't try to change the same thing at the same time. This typically ends up being a problem solved at a higher level. For example, if you acquire a lock for a certain data structure (before changing something in that data structure) and realize you need to allocate/free pages for that data structure (while you're trying to make the changes); then the code that modifies paging tables doesn't need to care about different CPUs changing the page tables at the same time because it knows that something at a higher level (the data structure's lock) already ensures that can't happen.

like image 44
Brendan Avatar answered Oct 18 '22 13:10

Brendan