CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

Tags:

It may seem a weird question..

Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).

There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.

1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.

2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.

My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?

Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?

I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.

It is even better if there is a quantitative research.

567

asked Aug 16 '17 18:08

Leedehai

2 Answers

There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).

Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)

Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)

In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.

One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)

(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)

answered Sep 28 '22 17:09

Paul A. Clayton

If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.

Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.

From Intel's optimization manual, in section 2.3 SANDY BRIDGE:

2.3.5.4 Data Prefetching

... Some prefetchers fetch into L1.

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

... several other prefetchers try to prefetch into L2

IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.

Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.

See Paul's excellent answer for other factors.

answered Sep 28 '22 15:09

Peter Cordes

Related questions
                            
                                Laravel caching queries with eager loading
                            
                                Improve Script performance by caching Spreadsheet values
                            
                                disable php files caching for debugging
                            
                                Redis cache design
                            
                                How can search results be cached? [closed]
                            
                                Disable caching of Web API responses
                            
                                Server side caching of dynamic content with Nginx and Etags
                            
                                What is L1 cache used for in NVIDIA's maxwell GPUs?
                            
                                Caching and AOP in Mendix: is there a uniform or standardized approach for server-side caching within a Mendix application?
                            
                                PhoneGap disable caching
                            
                                Operation against a key holding the wrong kind of value in redis when pushing data on a list
                            
                                Why MESI protocol need the Exclusive state
                            
                                Cache.getOrElse on Futures in Playframework
                            
                                Is it possible to prevent caching based on the size of the backend response in Varnish?
                            
                                ETag with no Cache-Control header in http response
                            
                                How to clear UWP WebView cache?
                            
                                How to force Chrome to reload a resource?
                            
                                Do I need cache if using CoreData
                            
                                iOS URLCache caching when it shouldn't (IMHO)
                            
                                Conflict Miss v/s Compulsory Miss

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

Tags:

cpu-architecture

caching

cpu-cache

Leedehai

People also ask

2 Answers

Paul A. Clayton

Peter Cordes

Recent Activity

Donate For Us