Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What makes a TLB faster than a Page Table if they both require two memory accesses?

Just going off wikipedia:

The page table, generally stored in main memory, keeps track of where the virtual pages are stored in the physical memory. This method uses two memory accesses (one for the page table entry, one for the byte) to access a byte. First, the page table is looked up for the frame number. Second, the frame number with the page offset gives the actual address. Thus any straightforward virtual memory scheme would have the effect of doubling the memory access time. Hence, the TLB is used to reduce the time taken to access the memory locations in the page table method.

So given that, what I'm curious about is why the TLB is actually faster because from what I know it's just a smaller, exact copy of the page table.

You still need to access the TLB to find the physical address, and then once you have that, you still need to actually access the data at the physical address, which is two lookups just like with the page table.

I can only think of two reasons why the TLB is faster:

  • looking up an address in the TLB or page table is not O(n) (I assumed it's O(1) like a hash table). Thus, since the TLB is much smaller, it's faster to do a lookup. Also in this case, why not just use a hash table instead of a TLB?

  • I incorrectly interpreted how the TLB works, and it's not actually doing two accesses.

like image 878
m0meni Avatar asked Dec 06 '22 16:12

m0meni


2 Answers

I realize it has been three years since this question was asked, but since it is still just as relevant, and it still shows up in search engines, I'll try my best to produce a complete answer.

Accessing the main memory through the TLB rather than the page table is faster primarily for two reasons:

1. The TLB is faster than main memory (which is where the page table resides).

The typical access time is in the order of <1 ns for the TLB and 100 ns for main memory
A TLB access is part of an L1 cache hit, and modern CPUs can do 2 loads per clock if they both hit in L1d cache.

The reasons for this are twofold:

  1. The TLB is located within the CPU, while main memory - and thus the page table - is not.
  2. The TLB - like other caches - is made of fast and expensive SRAM, whereas main memory usually consists of slow and inexpensive DRAM (read more here).

Thus, if the supposition that both the TLB and page table require only one memory access was correct, a TLB hit would still, roughly speaking, halve memory access time. However, as we shall see next, the supposition is not correct, and the benefit of having a TLB is even greater.

2. Accessing the page table usually requires multiple memory accesses.

This really is the crux of the issue.

Modern CPUs tend to use multilevel page tables in order to save memory. Most notably, x86-64 page tables currently consist of up to four levels (and a fifth may be coming). This means that accessing a single byte in memory through the page table requires up to five memory accesses: four for the page table and one for the data. Obviously the cost would be unbearably high if not for the TLB; it is easy to see why CPU and OS engineers put in a lot of effort to minimize the frequency of TLB misses.

Finally, do note that even this explanation is somewhat of a simplification, as it ignores, among other things, data caching. The detailed mechanics of modern desktop CPUs are complex and, to a degree, undisclosed. For a more detailed discussion on the topic, refer to this thread, for instance.

Page-table accesses can and are cached by data caches on modern CPUs, but the next access in a page-walk depends on the result of the first access (a pointer to the next level of the page table), so a 4-level page walk would have about 4x 4 cycle = 16 cycle latency even if all accesses hit in L1d cache. That would be a lot more for the pipeline to hide than the ~3 to 4 cycle TLB latency that's part of an L1d cache hit load in a modern Intel CPU (which of course uses TLBs for data and instruction accesses).

like image 59
ihonen Avatar answered Dec 31 '22 13:12

ihonen


You are right in your assumption that approach with TLB still requires 2 accesses. But the approach with TLB is faster because:

TLB is made of faster memory called associative memory

Usually we make 2 memory accesses to physical memory but with TLB there is 1 access to TLB and other access is to physical memory.

Associative memory is faster because it is content addressable memory but its expensive too , because of the extra logic circuits required.

You can read about the content addressable memory here.

like image 24
Sumeet Avatar answered Dec 31 '22 13:12

Sumeet