According to Intel my cache should be 24-way associative though its 12-way, how is that?

Question

According to “Intel 64 and IA-32 architectures optimization reference manual,” April 2012 page 2-23

The physical addresses of data kept in the LLC data arrays are distributed among the cache slices by a hash function, such that addresses are uniformly distributed. The data array in a cache block may have 4/8/12/16 ways corresponding to 0.5M/1M/1.5M/2M block size. However, due to the address distribution among the cache blocks from the software point of view, this does not appear as a normal N-way cache.

My computer is a 2-core Sandy Bridge with a 3 MB, 12-way set associative LLC cache. That does not seem to be coherent with Intels documentation though. According to the data it seems that I should have 24-ways. I can imagine there is something going on with the number of cores/cache-slices but I can't quite figure it out. If I have 2 cores and hence 2 cache slices 1.5 MB per slice, I would have 12 ways per cache slice according to Intel and that does not seem consistent with my CPU specs. Can someone clarify this to me?

If I wanted to evict an entire cache line would I need to access the cache in strides of 128 KB or 256 KB? In fact this is what I am trying to achieve.

Any suggested readings are very welcome.

Leeor · Accepted Answer

Associativity is orthogonal to the number of slices or to the mapping done by the hash function. If a given address is mapped to some cache slice(and a given set within it), it can only compete over the ways with other lines that were mapped to the same place. Having 2 slices does not raise associativity, it only reduces the contention (since lines are evenly distributed over more sets eventually).

Therefore you have 12 ways per slice, but the overall associativity per set is still 12 ways.

If you were to test your associativity by accessing different lines mapped to the same set, you will just have a harder time picking such lines (you'll need to know the hash function), but you're still going to get thrashing after 12 lines. However, if you were to ignore the hashing, and assume lines are simply mapped by their set bits, I could appear as if you have higher associativity simply because the lines would divide uniformly between the slices, so thrashing would take longer. This isn't real associativity, but it comes close for some practical purposes. It would only work if you're using a wide physical memory range though, since the upper bits need to change for the hashing to make any impact.

Peter Cordes · Answer

Having 2 slices doubles the number of sets, not the number of ways per set. The latter would require every slice to check its tags for a set, so bandwidth wouldn't scale with cores (where every core has a slice of L3).

The actual design means that the index determines a single stop on the ring bus which needs to handle a request for a single line.

If I wanted to evict an entire cache line would I need to access the cache in strides of 128 KB or 256 KB? In fact this is what I am trying to achieve.

Neither, it's not that simple. Unlike the smaller / faster caches, the index for the last-level cache isn't a simple range of bits from the address. It's more like a hash function of all the address bits above the offset into the cache line, which reduces collisions when large strides happen by accident, or when multiple programs (or instances of the same program) on the same system use the same offset relative to a hugepage or whatever other boundary.

The last-level cache indexing function is one of Intel's secret ingredients; AFAIK it hasn't been reverse-engineered or published, but I haven't gone looking.

Obviously you can use a large buffer to have a very high chance of having evicted a line before you come to it again, but IDK if there's a good way otherwise. clflushopt has similar cost to a store; having to make sure no copy of the cache line still exists.

prefetchnta prefetches into L1, and into L3 with fast eviction (using only limited ways). In practice it can give L3 misses with a working set smaller than L3 without force-evictions, only effectively conflict misses.

According to Intel my cache should be 24-way associative though its 12-way, how is that?

Tags:

performance

cpu-architecture

cpu-cache

intel

micro-optimization

alex10791

2 Answers

Leeor

Peter Cordes

Recent Activity

Donate For Us

According to Intel my cache should be 24-way associative though its 12-way, how is that?

Tags:

performance

cpu-architecture

cpu-cache

intel

micro-optimization

alex10791

2 Answers

Leeor

Peter Cordes

Related questions

Recent Activity

Donate For Us