Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way fetch an array of memory values

At the heart of an indexing structure, I find myself wondering if an optimization could be made for the following problem :

I have a big (several GB of RAM) array of small structures(in RAM), I have a smaller array of indices (in the order of 10e4 elements). The indices are almost randomly spread. I have an agregating function that is order independant ("associative" for the mathematicians), let's say for example "sum".

I want to agregate the values retrieved from the big array at the indices specified in the small array.

Currently I spend most of the time fetching from memory (As the indices are randomly spread and the table is big there are lots of cache misses, but as I have knowledge of the indices there should be some prefetching available). I find it hard to profile whether or not some prefetching optimizations are currently taking place, or how much speedup, I could expect from such an optimization ?

So my question is, what's the fastest way to fetch from known memory locations. Are there some dark-art programming magic to do so ? Are there some architecture/platform specific way to do so ? I'm looking for c++ or c# solutions.

like image 550
darkblue Avatar asked Nov 18 '13 13:11

darkblue


1 Answers

Without knowing anything else about your problem or your current implementation, one (somewhat) easy way to improve performance (to some extent) is to manually prefetch the values that your "sum" function is going to operate on.

Ignoring architectural and compiler nuances for now, manual prefetching could look like this:

SmallStruct values [value_count] = {/*whatever*/};
int indices [index_count] = {/*whatever*/};
...

SmallStruct v = values[indices[0]];
for (int i = 1; i < index_count; ++i)
{
    SmallStruct v_next = values[indices[i]];
    DoSomethingWith (v); // Note the *v*
    v = v_next; // You don't want to copy, but this is the simplest form
}
DoSomethingWith (v); // Do the final item

The above is the simplest possible form of prefetching. You can unroll the loop a little bit to avoid the copying mentioned above, and also you probably want to do more than a single prefetch.

This optimization works because most (all?) modern architectures can have more than one memory request in flight, which means that those requests are overlapped and the average waiting time for those (presumably uncached) requests is divided by their concurrency (which is a good thing!) So, it wouldn't matter how many unused cache lines you have; the important factor is the number of concurrent memory reads the memory system can sustain at any given time.

A Note on the Effect of Cache Lines

The above (admittedly simplistic) code ignores two very important facts: that the entire SmallStruct cannot be read in one memory access (from CPU's point of view) which is a bad thing, and that memory is always read in units of cache lines (64 or 128 bytes, these days) anyways, which is very good!

So, instead of trying to read the entire values[indices[i]] into v_next, we can just read one single byte, and assuming the values array is properly aligned, a significant amount of memory (one full cache line) will be loaded and at hand for eventual processing.

Two important points:

  1. If your SmallStruct is not in fact small and won't fit entirely in a cache line, you must rearrange its members to make sure that the parts of it that is required in DoSomethingWith() are contiguous and packed and fit in one cache line. If they still don't fit, you should consider separating your algorithm into two or more passes, each operating on data that fits in one cache line.
  2. If you just read one byte (or one word, or whatever) from the next value you'll be accessing, make sure that the compiler doesn't optimize that read away!

Alternate Implementations

The second point above can be expressed in code, like this:

touch (&values[indices[0]]);
for (int i = 0; i < index_count; ++i)
{
    if (i + 1 < index_count)
        touch (&values[indices[i + 1]]);

    DoSomethingWith (values[indices[i]]);
}

The touch() function is semantically like this (although the implementation would probably be more involved.)

void touch (void * p)
{
    char c = *(char *)p;
}

To prefetch more than one value, you'd do something like this: (Update: I changed my code to (I believe) a better implementation.)

const int PrefetchCount = 3;

// Get the ball rolling...
for (int j = 0; j < PrefetchCount; ++j)
    touch (&values[indices[j]]);

for (int i = 0; i < index_count; ++i)
{
    if (i + PrefetchCount < index_count)
        touch (&values[indices[i + PrefetchCount]]);

    DoSomethingWith (values[indices[i]]);
}

Again, note that all the implementations above are very simple and simplistic. Also, if you prefetch too much, you can blow your L1 cache and your performance with it.

Doing the Actual Prefetch

The x86-64 CPU has an instruction that you use to ask the CPU to prefetch a cache-line-worth of memory data into its caches. Actually, using this instruction you give a hint to the CPU that that particular memory location is going to be used by your application and the CPU will try to bring it into cache. If you do this soon enough, the data will be ready by the time you need it and your computations won't stall.

The instruction is PREFETCH* and you can use compiler-specific intrinsics instead of resorting to assembly. These intrinsics are called _mm_prefetch for Microsoft and Intel C++ compilers, and __builtin_prefetch on GCC. (If you ended up using this, just remember that you want the lowest level of prefetching, i.e. T0.)

Note that these go into the implementation of the touch function I used above.

I know of no library that does this in a reusable way. Also, I have no familiarity with C# libraries to know whether these are available there or not.

like image 50
yzt Avatar answered Nov 08 '22 01:11

yzt