Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

Tags:

linux-kernel

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.

Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:

1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)

2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)

3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)

Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?

Thanks! Will help me to better understand the kernel...

252

asked Nov 14 '11 15:11

GigaQuantum

2 Answers

The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.

The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).

Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().

On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.

122

answered Oct 11 '22 12:10

caf

Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).

Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?

[update]

Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.

Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.

To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.

answered Oct 11 '22 12:10

Dan Kruchinin

Related questions
                            
                                What's the difference b/w __raw_readl/__raw_writel and readl/writel in linux kernel?
                            
                                What Data Structures are available in the Linux Kernel [closed]
                            
                                Send a raw ethernet packet from inside a kernel module
                            
                                convert jiffies to seconds
                            
                                Kernel freeze : How to debug it?
                            
                                Implementing poll in a Linux kernel module
                            
                                Why are there no debug symbols in my vmlinux when using gdb with /proc/kcore?
                            
                                How do I determine if a connected USB device is a USB flash drive?
                            
                                ioctl is not called if cmd = 2
                            
                                What are inode generation numbers?
                            
                                Is there something wrong with my spin lock?
                            
                                Simple interrupt handler: request_irq returns error code -22
                            
                                implementation of dirty_expire_centisecs
                            
                                Marking loadable kernel module as in-tree
                            
                                How to modify kernel DTB file
                            
                                attempting to install new kernel, error modules.order & Makefile Error 2
                            
                                Using the Linux sysfs_notify call
                            
                                What is the significance of /queue/rotational in Linux?
                            
                                Enable monitoring mode for RTL8188CUS via USB on Raspbian
                            
                                What is the difference between module_init and init_module in a Linux kernel module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With