How does cpu cache handle large memory objects?

Question

Scenario:

Cache (L1) size (CS): 32kB
Line size (LS): 64B
Associativity (A): 8
Set size (SS): 512B (A * LS)
Sets (S): 64 (C / SS)
Read/written object (O) has size greater than LS

Assumptions (correct me if invalid):

Virtual memory blocks (of size 4kB (SS * A) denoted as B) are mapped in modulo-like manner to sets. In other words, addresses 0x0 : 0xFFFF (block index (BI) 0) are mapped to set 0, 0x1000 : 0x1FFF (BI 1) are mapped to 1, and so forth.
Request of reading/writing (no non-temporal writes/reads are used) a given address A requires finding its BI and then moving it to the assigned set. For instance, A = 0x4600A will have BI = 70. This BI is mapped to set 6 (BI % S).
In order to properly (without misalignment) r/w an object (O) to cache, an alignment of LS is required.

Questions:

Will the O be serially aligned in the cache or it can take (for instance) free slots 0 & 4 & 5, instead of 0 & 1 & 2?
What is the cost (penalty) of retrieving partitioned O from cache? Assume that the O isn't partitioned across several B.
The same question as above, but in case when O is placed in two B, thus two sets are used.
What will happen if the O size is larger than the SS (512B)? Will it move the data to L2 and step-by-step move data to L1? Will it use other sets?
What if L2 (and L3 for that matter) is too small for all the data?

chus · Accepted Answer

Virtual memory blocks (of size 4kB (SS * A) denoted as B) are mapped in modulo-like manner to sets. In other words, addresses 0x0 : 0xFFFF (block index (BI) 0) are mapped to set 0, 0x1000 : 0x1FFF (BI 1) are mapped to 1, and so forth.

Transfer between L1 cache and the memory hierarchy: the transfer unit between the L1 cache and the following level of the memory hierarchy is a block of line size (LS) bytes. That is, to your L1 cache, memory is structured in 64 bytes blocks (LS bytes).

Correspondence between memory blocks and cache entries: consecutive memory blocks are mapped to cache lines of consecutive sets. Hence, block 0 (addresses 0x0000 : 0x003F) is mapped to a cache line at set 0, block 1 (addresses 0x0040 : 0x007F) is mapped to a cache line at set 1, and so forth.

Request of reading/writing (no non-temporal writes/reads are used) a given address A requires finding its BI and then moving it to the assigned set. For instance, A = 0x4600A will have BI = 70. This BI is mapped to set 6 (BI % S).

The correct procedure to find the block identifier (or index) and the set index (SI) is the following:

 BI = A >> LS = 0x4600A >> 6 = 0x1180
 SI = BI & (S-1) = 0x1180 & 0x3F = 0x0000
 (when S is a power of two, BI & (S-1) = BI  mod S)

In order to properly (without misalignment) r/w an object (O) to cache, an alignment of LS is required.

That is not necessary. O does not need to be block-aligned.

Q1. Will the O be serially aligned in the cache or it can take (for instance) free slots 0 & 4 & 5, instead of 0 & 1 & 2?

O blocks will be stored in consecutive sets with cache line granularity (set k, k+1, …, S-1, 0, 1, …) .

Q2. What is the cost (penalty) of retrieving partitioned O from cache? Assume that the O isn't partitioned across several B. Q3. The same question as above, but in case when O is placed in two B, thus two sets are used.

I assume your are interested in the cost of the CPU reading the O words from cache. Supposing O is referenced sequentially, the number of cache accesses will be equal to the number of referenced words. I think the cost does not depend on the blocks being in the same or in different sets (at least in a multiported cache).

Q4. What will happen if the O size is larger than the SS (512B)? Will it move the data to L2 and step-by-step move data to L1? Will it use other sets?

Q5. What if L2 (and L3 for that matter) is too small for all the data?

If a block has to be allocated to a set with no free cache lines, a block has to be selected in order to be evicted (victim block). The replacement policy select the victim block according to an algorithm (LRU, pLRU, random).

How does cpu cache handle large memory objects?

Tags:

optimization

cpu-cache

Red XIII

1 Answers

chus

Recent Activity

Donate For Us

How does cpu cache handle large memory objects?

Tags:

optimization

cpu-cache

Red XIII

1 Answers

chus

Related questions

Recent Activity

Donate For Us