Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding how the CPU decides what gets loaded into cache memory

Lets say a computer has 64k of L1 cache and 512k of L2 cache.

The programmer has created/populated an array of say 10mb of data in main memory (e.g. the vertex / index data of a 3d model).

The array might contain a series of structs like:

struct x
{
  vec3 pos;
  vec3 normal;
  vec2 texcoord;
};

Next the programmer has to perform some operation on all this data, e.g. one time normal computation, before passing the data over to the GPU.

How does the CPU decide how data gets loaded into L2 cache?

How can the programmer check what size a cache line is for any given architecture?

How can the programmer ensure that data is organised so that it fits into cache lines?

Is data alignment to byte boundaries the only thing that can be done to aid this process?

What can the programmer do to minimize cache misses ?

What profiling tools are available that'll help visualize the optimization process for the windows and linux platforms?

like image 668
fishfood Avatar asked Sep 02 '13 08:09

fishfood


1 Answers

There are a lot of questions here so I will keep the answers brief.

How does the CPU decide how data gets loaded into L2 cache?

Whatever you use, gets loaded. L2 behaves the same as L1 except there's more of it, and aliasing (which may result in premature eviction) is more common because of larger lines and less set associativity. Some CPUs only load L2 with data that is getting pushed out of L1, but it doesn't make much difference to the programmer.

Most MMUs have a facility for uncached memory, but this is for device drivers. I don't recall ever seeing an option to disable L2 without disabling L1. With no caching, you get no performance.

How can the programmer check what size a cache line is for any given architecture?

By consulting the user manual. Some operating systems provide a query facility like sysctl.

How can the programmer ensure that data is organised so that it fits into cache lines?

The key idea is spatial locality. Data which is accessed at the same time, by the same inner loop, should go into the same data structure. The optimal organization is to fit that structure onto a cache line and align it to the cache line size.

Don't go to the trouble unless you are carefully using your profiler as a guide.

Is data alignment to byte boundaries the only thing that can be done to aid this process?

No, the other part is avoiding filling the cache with extraneous data. If some fields are only going to be used by some other algorithm, then they are wasting cache space while the present algorithm runs. But you can't optimize everything all the time, and reorganizing the data structures takes programming effort.

What can the programmer do to minimize cache misses ?

Profile using real-world data, and treat excessive misses as a bug.

What profiling tools are available that'll help visualize the optimization process for the windows and linux platforms?

Cachegrind is very nice but uses a virtual machine. Intel V-Tune uses your actual hardware, for better or worse. I haven't used the latter.

like image 142
Potatoswatter Avatar answered Sep 22 '22 12:09

Potatoswatter