Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

decreasing cache misses through good design

Tags:

How to decrease the number of possible cache misses when designing a C++ program?

Does inlining functions help every time? or is it good only when the program is CPU-bounded (i.e. the program is computation oriented not I/O oriented)?

like image 355
Josef Avatar asked Jan 20 '09 10:01

Josef


People also ask

Which technique used to reduce the miss penalty for improving cache performance?

Let's take a look at some other techniques for reducing the miss penalty. This technique is used with write-through or write-back. The idea is not to make the CPU wait for the write to complete in memory. Instead, data is written to a write buffer, and the processor can continuef while it is being written to memory.

What is a way to reduce the miss penalty?

Reducing Cache Miss Penalty In conjunction with early restart , this reduces the miss penalty by allowing the CPU to continue execution while most of the block is still being fetched.

What affects cache miss rate?

The worst cache miss rate occurs when there is no tiling, but the worst CPI occurs with tile size 288 × 288. CPI improves slightly when tiling is discontinued. This is likely due to lower instruction CPI that results from the reduction of executed branch instructions from needing fewer iterations of the tile loops.

How does cache size affect miss rate?

Cache size and miss rates The cache size also has a significant impact on performance. — The larger a cache is, the less chance there will be of a conflict. — Again this means the miss rate decreases, so the AMAT and number of memory stall cycles also decrease.


2 Answers

Here are some things that I like consider when working on this kind of code.

  • Consider whether you want "structures of arrays" or "arrays of structures". Which you want to use will depend on each part of the data.
  • Try to keep structures to multiples of 32 bytes so they pack cache lines evenly.
  • Partition your data in hot and cold elements. If you have an array of objects of class o, and you use o.x, o.y, o.z together frequently but only occasionally need to access o.i, o.j, o.k then consider puting o.x, o.y, and o.z together and moving the i, j, and k parts to a parallel axillary data structure.
  • If you have multi dimensional arrays of data then with the usual row-order layouts, access will be very fast when scanning along the preferred dimension and very slow along the others. Mapping it along a space-filling curve instead will help to balance access speeds when traversing in any dimension. (Blocking techniques are similar -- they're just Z-order with a larger radix.)
  • If you must incur a cache miss, then try to do as much as possible with that data in order to amortize the cost.
  • Are you doing anything multi-threaded? Watch out for slowdowns from cache consistency protocols. Pad flags and small counters so that they'll be on separate cache lines.
  • SSE on Intel provides some prefetch intrinsics if you know what you'll be accessing far enough ahead of time.
like image 73
Boojum Avatar answered Sep 27 '22 21:09

Boojum


For data bound operations

  1. use arrays & vectors over lists,maps & sets

  2. process by rows over columns

like image 20
Chris Avatar answered Sep 27 '22 23:09

Chris