I am doing some cache performance measuring and I need to ensure the caches are empty of "useful" data before timing.
Assuming an L3 cache is 10MB would it suffice to create a vector of 10M/4 = 2,500,000 floats, iterate through the whole of this vector, sum the numbers and that would empty the whole cache of any data which was in it prior to iterating through the vector?
You should be aware that L1 and sometimes L2 caches are per core and by clearing the L3 cache, you could still run into trouble if your program switches cores. Show activity on this post. Yes, that should be sufficient for flushing the L3 cache of useful data.
The L2 cache contains data that is likely to be accessed by the CPU for the following code in the execution. In most modern CPUs, the L1 and L2 caches are located on the inside of the CPU itself.
Imagine that a CPU has to load data from the L1 cache 100 times in a row. The L1 cache has a 1ns access latency and a 100 percent hit rate. It, therefore, takes our CPU 100 nanoseconds to perform this operation. Haswell-E die shot (click to zoom in).
You’ll notice CPU cache is always backed by the term L1, L2, L3, and sometimes even L4. These terms denote the multi-level cache used for CPUs. So, L1 would be level 1, L2 is level 2, and L3, of course, level 3. L1 is the fastest memory found in any consumer PC.
Yes, that should be sufficient for flushing the L3
cache of useful data.
I have done similar types of measurements and cross-verified by using Intel's cache counters to verify that I incur the expected number of L3
cache misses during my tests.
If you want to absolutely sure, you should also use the counters. In particular, you can measure last-level cache misses by using Event select 2EH, Umask 41H
in most Intel architectures.
See the Intel Manual for details on these counters.
It depends on how insane you are trying to be to get your guarantee.
x86_64 L3 cache is physically indexed, and while a 10MiB chunk that's linear in virtual space is almost definitely going to be physically contiguous on a lightly mem-loaded machine, it's not guaranteed.
Sandy and Ivy Bridge, for example, have L3 cache in 2MiB slices with 16-way set associativity (128kiB stride), so you could guarantee physical coverage by doing a MAP_HUGETLB
mmap()
call, assuming standard 2-4MiB huge pages.
Also, since each slice (on new Sandy/Ivy Bridge at least) is attached to a different core, and which slice a given physical address resides on is determined by a hash of some low/middle-order address bits, you might have to make an array slightly larger than the size of L3 to counter for minutely uneven overlap.
At this point, scrubbing your array a few times linearly should do the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With