I am having Intel Core IvyBridge processor , Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz( L1-32KB,L2-256KB,L3-8MB). I know L3 is inclusive and shared among multiple core. I want to know the following with respect to my system
PART1 :
PART2 :
If L1 and L2 are both inclusive then to find the access time of L2 we first declare an array(1MB) of size more than L2 cache(256KB) , then start accessing the whole array to load into L2 cache. After that we access the array element from start index to end index with stride of 64B as cache line size is 64B. To get better accurate result we repeat this process(accessing array elements at index ,start-end) for multiple times, say 1 million times and takes the average.
My understanding why this approach gives correct result as follows- When we access the array of size more than L2 cache size, then whole array is loaded from main memory to L3, then from L3 to L2, then L2 to L1. The last 32KB of the whole array is in L1 as it is recently accessed. The whole array is also present in L2 and L3 cache also due to inclusive property and cache coherency . Now, when I start accessing the array again from starting index, which is not in L1 cache, but in L2 cache, so there will be a cache miss and it will be loaded from L2 cache. And this way there will be higher access time required for all elements of whole array and in total I will get the total access time of whole array. To get the single access I will take the average of total no of access .
My question is - Am I correct ?
Thanks in advance .
An advantage of inclusive caches is that what's been brought into the cache hierarchy by one core is available to the other core. AMD processors tend to have exclusive caches; Intel processors tend to have inclusive caches.
This is an inclusive cache model, where the same data can be present in both the L1 and L2 caches. In an exclusive cache, data can be present in only one cache and an address cannot be found in both the L1 and L2 caches at the same time.
Modern CPUs also often have a very small “L0” cache, which is often just a few KB in size and is used for storing micro-ops. AMD and Intel both use this kind of cache; Zen had a 2,048 µOP cache, while Zen 2 has a 4,096 µOP cache.
L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1. L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory.
See section 2.2.5 in the Intel optimization guide -
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
(note that this applies for Sandy-Bridge, but doesn't appear as changed for Ivy-Bridge, which has only minor micro-architectural changes over the previous generation).
So regarding your questions:
Also note that if your benchmark is accessing a data-set larger than the L2, it will probably fail to sit in the L2 (especially if you access it serially and exceed the L2 by more than the size of a single way), and you'd have to fetch it from the L3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With