Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cache locality for a binary tree

Tags:

c

caching

If I have a tree like the following

struct tree_t {
    //data
    tree_t *left;
    tree_t *right;
};

and I want to start allocating memory for the leaves, is there a way to ensure that when I traverse the tree the leaves are being cached? If I was using malloc then I think the leaves would be scattered around in the heap and there would be a cache miss every time I try to access one.

like image 371
toweleeele Avatar asked Jul 08 '15 01:07

toweleeele


2 Answers

Others gave the proper answer, a fixed-size pool (https://en.wikipedia.org/wiki/Memory_pool), but there are some additional caveats that merit a more-thorough explanation. It is still possible or even likely that leaves allocated using a memory pool will have a low cache hit rate. Blocks of 8*n tree_t's aligned on 64-byte boundaries are ideal, though there is no benefit above n == 1024.

As a side note, look at Judy arrays, which are a cache-optimized tree-ish data structure (https://en.wikipedia.org/wiki/Judy_array).

It is helpful to review (briefly) how the cache works. Caches are divided up into sets of fixed-size lines. Typically the line size is 64 bytes in L1; Intel and AMD have used 64-byte L1 caches for 15 years, and modern ARM processors such as the A15 also use 64-byte lines. The associativity determines how many lines correspond to a set. When data is brought into the cache, some function hashes the address to a set. The data may be stored in any of the lines in the set. For example, in a 2-way set associative cache, any given address can be stored in one of two possible cache lines.

Maximizing cacheability involves reducing cache line fetches:
1. Organizing data into cache-line-sized chunks.
2. Aligning chunks on a cache line boundary.
3. Allocating chunks at addresses that map to different cache sets.
4. Storing data with temporal locality (i.e. accessed around the same time) in the same cache line.
5. Reducing the size of data, if possible, to increase density.

If you don't do (1), then fetches will bring in nearby, likely-unuseful data, reducing the amount of space for data that you care about. If you don't do (2), then your objects are likely to straddle cache lines, requiring twice as many fetches. If you don't do (3), then some cache sets are likely to be underutilized, with similar effect as (1). If you don't do (4), then even though you're maximizing cache utilization, most of the data fetched isn't useful when it's fetched, and the line likely gets evicted before the data becomes useful. (5) increases the number of objects that fit in the cache by packing them into less space. For example, if you can guarantee that you will have fewer than 2^32 leaves, then you could store a uint32_t index into a tree_t[] array rather than pointers, which is a 100% improvement on 64-bit platforms.

Note: malloc() typically returns 8-byte- or 16-byte-aligned blocks, which are unsuitable; use posix_memalign() on GCC or _aligned_malloc() on MSVC.

In your case, you're traversing a tree, presumably an in-order traversal. Unless your data set fits into the cache, leaves are likely to be spread evenly throughout and ergo are unlikely to have temporal locality with nodes in the same cache line. In that case, the best you can do is to make sure that your objects do not straddle cache lines by allocating cache-line-sized and cache-line-aligned blocks.

I picked blocks of 8*n tree_t's on the conservative assumption of a 64-byte cache line and 4-byte pointers, which results in an 8-byte tree_t, and 64 / 8 = 8 tree_t/line. The upper bound of n == 1024 is due to a particular x86 CPU (which one escapes me at the moment) ignoring bit 18 of the address for the purposes of picking a set.

like image 63
icecreamsword Avatar answered Sep 30 '22 02:09

icecreamsword


It may be possible on a select platform to improve cache hits - but of course there is little to guarantee consistent success from run to run.

But let us try some ideas:

  1. Create tree_alloc() and tree_free() that allocate/mange multiple struct tree_t in a group, say 256 on the first call and then doles them out for the next 255 allocations. This will complicate random allocate/free calls, but if the tree is large and its growth/shrink uniform, it may be worth the effort.

  2. Make tree_t small. Make data a pointer.

    struct tree_t {
      data_T *data
      tree_t *left;
      tree_t *right;
     };
    

Oops! GTG - will make this wiki

like image 25
chux - Reinstate Monica Avatar answered Sep 30 '22 02:09

chux - Reinstate Monica