Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..

Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).

There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.

1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.

2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.

My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?

Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?

I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.

It is even better if there is a quantitative research.

like image 567
Leedehai Avatar asked Aug 16 '17 18:08

Leedehai


People also ask

How big should be the line size of a cache?

The cache line is generally fixed in size, typically ranging from 16 to 256 bytes. The effectiveness of the line size depends on the application, and cache circuits may be configurable to a different line size by the system designer.

Why must the L1 cache be smaller than the L2 cache?

If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2.

How does CPU cache affect performance?

Storing instructions in cache reduces the amount of time it takes to access that instruction and pass it to a CPU core. If a system does not use caching, then there is an increased need for accessing the main memory, thereby increasing the time it takes for an operation to be carried out.

Will a smaller cache speed up the processor?

Using two small caches increases performance. The data requested most recently is typically the data most likely to be needed again. Therefore, the CPU will always check the level 1 cache first. If the requested data is found in the level 1 cache, there's no need for the CPU to even bother checking the level 2 cache.


2 Answers

There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).

Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)

Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)

In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.

One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)

(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)

like image 72
Paul A. Clayton Avatar answered Sep 28 '22 17:09

Paul A. Clayton


If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.


Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.

From Intel's optimization manual, in section 2.3 SANDY BRIDGE:

2.3.5.4 Data Prefetching

  • ... Some prefetchers fetch into L1.

  • Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

  • ... several other prefetchers try to prefetch into L2

IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.

Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.


See Paul's excellent answer for other factors.

like image 20
Peter Cordes Avatar answered Sep 28 '22 15:09

Peter Cordes