Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does one write code that best utilizes the CPU cache to improve performance?

People also ask

How can the use of cache improve the performance of cpus?

The more cache there is, the more data can be stored closer to the CPU. Cache memory is beneficial because: Cache memory holds frequently used instructions/data which the processor may require next and it is faster access memory than RAM, since it is on the same chip as the processor.

How do you improve the cache performance?

The performance of cache memory is frequently measured in terms of a quantity called Hit ratio. We can improve Cache performance using higher cache block size, higher associativity, reduce miss rate, reduce miss penalty, and reduce the time to hit in the cache.

How do the L1 L2 and L3 caches improve CPU performance?

They are extra caches built between the CPU and the RAM. Sometimes L2 is built into the CPU with L1. L2 and L3 caches take slightly longer to access than L1. The more L2 and L3 memory available, the faster a computer can run.

What is a cache friendly code?

Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses. So an example would be imagine you wanted to copy a giant 2 dimensional table. It is organized with reach row in consecutive in memory, and one row follow the next right after.


The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).

Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.

Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.

Improving cache line utilization helps in three respects:

  • It tends to fit more useful data in the cache, essentially increasing the effective cache size.
  • It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
  • It reduces the memory bandwidth requirements, as there will be fewer fetches.

Common techniques are:

  • Use smaller data types
  • Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
  • Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
  • Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
  • avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.

We should also note that there are other ways to hide memory latency than using caches.

Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.

Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).

To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.

Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.

While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.

These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.


I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":

pseudocode
for (i = 0 to size)
  for (j = 0 to size)
    do something with ary[j][i]

The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:

for (i = 0 to size)
  for (j = 0 to size)
    do something with ary[i][j]

It will run much faster.


The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.

The cache works on two principles: Temporal locality and spatial locality. The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.

The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.

And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.

A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.

The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.

Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.

In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.

And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.


I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.

Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).