Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write or read memory without touching cache

Is there any way to write/read memory without touching L1/L2/L3 cache under x86 CPUs?

And is cache in x86 CPUs totally managed by hardware?

EDIT: I want to do this because I want to sample the speed of memory and see if any part of memory's performance degrades.

like image 628
Michael Tong Avatar asked Feb 23 '15 22:02

Michael Tong


2 Answers

The CPU indeed manages its own caches in hardware, but x86 provides you some ways to affect this management.

To access memory without caching, you could:

  1. Use the x86 non-temporal instructions, they're meant to tell the CPU that you won't be reusing this data again, so there's no point in retaining it in the cache. These instructions in x86 are usually called movnt* (with the suffix according to data type, for e.g. movnti for loading normal integers to general purpose registers). There are also instructions for streaming loads/stores that also use a similar technique but are more appropriate for high BW streams (when you load full lines consecutively). To use these, either code them in inline assembly, or use the intrinsics provided by your compiler, most of them call that family _mm_stream_*

  2. Change the memory type of the specific region to uncacheable. Since you stated you don't want to disable all caching (and rightfully so, since that would also include code, stack, page map, etc..), you could define the specific region your benchmark's data-set resides in as uncacheable, using MTRRs (memory type range registers). There are several ways of doing that, you'll need to read some documentation for that.

  3. The last option is to fetch the line normally, which means it does get cached initially, but then force it to clear out of all cache levels using the dedicated clflush instruction (or the full wbinvd if you want to flush the entire cache). Make sure to properly fence these operations so that you can guarantee they're done (and of course don't measure them as part of the latency).

Having said that, if you want to do all this just to time your memory reads, you may get bad results, since most of the CPUs handle non-temporal or uncacheable accesses "inefficiently". If you're just after forcing reads to come from memory, this is best achieved through manipulating the caches LRUs by sequentially accessing a data set that's large enough to not fit in any cache. This would make most LRU schemes (not all!) drop the oldest lines first, so the next time you wrap around, they'll have to come from memory.

Note that for that to work, you need to make sure your HW prefetcher does not help (and accidentally covers the latency you want to measure) - either disable it, or make the accesses stride far enough for it to be ineffective.

like image 126
Leeor Avatar answered Sep 28 '22 00:09

Leeor


Leeor preety much listed the most "pro" solutions for your task. I'll try to add to that with another proposal that can achieve same results, and can be written in plain C with a simple code. The idea is making a kernel similar to "Global Random Access" found in the HPCC Challenge benchmark.

The idea of the kernel is to jump randomly through a huge array of 8B values that is generraly 1/2 the size of your physical memory (So if you have 16 GB of RAM you need an 8GB array leading to 1G elements of 8B). For each jump you can read, write or RMW the target location.

This most likely measures the RAM latency because jumping randomly through RAM makes caching very inefficient. You will get extremely low cache hit rates and if you make sufficient operations on the array, you will be able to measure the actual performance of memory. This method also makes prefetching very ineffective as there is no detectable pattern.

You need to take into consideration following things:

  1. Make sure that the compiler does not optimize away your kernel loop (make sure to do something on that array or make something with the values you read from it).
  2. Use a very simple random number generator and do not store the target addresses in another array (that will be cached). I used a Linear Congruential Generator. This way the next address is calculated very rapidly and does not add extra latencies other than those of RAM.
like image 34
VAndrei Avatar answered Sep 28 '22 00:09

VAndrei