I have a question regarding the relation between cache misses of difference cache levels in a x86 architecture (Say Xeon X5660).
I did some profiling over an OpenCL application (Blackscholes), on some performance counters. For each counter, I sum up all the values over all cores and get this result:
instructions #: 493167746502.000000
L3_MISS #: 1967809.000000
L1_MISS #: 2344383795.000000
L2_DATA_MISS #: 901131.000000
L2_MISS #: 1397931.000000
memory loads #: 151559373227.000000
The question is why the number of L3 misses is bigger than the number of L2 misses? (I keep rerunning the profiling many times and the variance is not significant). What I thought basically is:
L2 misses = L3 hits + L3 misses
Could someone explain me what goes wrong here, did I miss something?
Putting it a bit further, what causes a cache read for the last level cache (CPU) of CPU? Is it just simply a data miss from L2?
Thanks
A cache miss occurs either because the data was never placed in the cache, or because the data was removed (“evicted”) from the cache by either the caching system itself or an external application that specifically made that eviction request.
Level 3 (L3) cache is specialized memory developed to improve the performance of L1 and L2. L1 or L2 can be significantly faster than L3, though L3 is usually double the speed of DRAM. With multicore processors, each core can have dedicated L1 and L2 cache, but they can share an L3 cache.
In the case of a cache hit, the processor immediately reads or writes the data in the cache line. For a cache miss, the cache allocates a new entry and copies data from main memory, then the request is fulfilled from the contents of the cache.
The worst cache miss rate occurs when there is no tiling, but the worst CPI occurs with tile size 288 × 288. CPI improves slightly when tiling is discontinued. This is likely due to lower instruction CPI that results from the reduction of executed branch instructions from needing fewer iterations of the tile loops.
Image Ref : http://www.theregister.co.uk/2010/02/03/intel_westmere_ep_preview/
As you can see above, In 'Westmere-EP' architecture block of 3 cores share a section of L3 cache. So what "boiler96" says makes sense. You are either getting L2 misses for individual core or your L3 miss count is coming from Uncore which is combined miss count of misses from all cores.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With