Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes a L3 cache miss in CPU?

I have a question regarding the relation between cache misses of difference cache levels in a x86 architecture (Say Xeon X5660).

I did some profiling over an OpenCL application (Blackscholes), on some performance counters. For each counter, I sum up all the values over all cores and get this result:

 instructions #: 493167746502.000000 

 L3_MISS #: 1967809.000000 

 L1_MISS  #: 2344383795.000000 

 L2_DATA_MISS #: 901131.000000 

 L2_MISS #: 1397931.000000 

 memory loads #: 151559373227.000000

The question is why the number of L3 misses is bigger than the number of L2 misses? (I keep rerunning the profiling many times and the variance is not significant). What I thought basically is:

L2 misses = L3 hits + L3 misses

Could someone explain me what goes wrong here, did I miss something?

Putting it a bit further, what causes a cache read for the last level cache (CPU) of CPU? Is it just simply a data miss from L2?

Thanks

like image 401
Zk1001 Avatar asked May 02 '12 13:05

Zk1001


People also ask

What would cause a cache miss?

A cache miss occurs either because the data was never placed in the cache, or because the data was removed (“evicted”) from the cache by either the caching system itself or an external application that specifically made that eviction request.

What does L3 cache do in CPU?

Level 3 (L3) cache is specialized memory developed to improve the performance of L1 and L2. L1 or L2 can be significantly faster than L3, though L3 is usually double the speed of DRAM. With multicore processors, each core can have dedicated L1 and L2 cache, but they can share an L3 cache.

What does a CPU do in cache miss?

In the case of a cache hit, the processor immediately reads or writes the data in the cache line. For a cache miss, the cache allocates a new entry and copies data from main memory, then the request is fulfilled from the contents of the cache.

What causes high miss rate cache memory?

The worst cache miss rate occurs when there is no tiling, but the worst CPI occurs with tile size 288 × 288. CPI improves slightly when tiling is discontinued. This is likely due to lower instruction CPI that results from the reduction of executed branch instructions from needing fewer iterations of the tile loops.


1 Answers

The 32 nanometer, six core Westmere-EP chip

Image Ref : http://www.theregister.co.uk/2010/02/03/intel_westmere_ep_preview/

As you can see above, In 'Westmere-EP' architecture block of 3 cores share a section of L3 cache. So what "boiler96" says makes sense. You are either getting L2 misses for individual core or your L3 miss count is coming from Uncore which is combined miss count of misses from all cores.

like image 102
dvishal Avatar answered Sep 18 '22 06:09

dvishal