I am writing a program to parse a file. It consists of a main loop that parses character by character and treats them. Here is the main loop:
char c;
char * ptr;
for( size_t i = 0; i < size ; ++i )
{
ptr = ( static_cast<char*>(sentenceMap) + i );
c = *ptr;
__builtin_prefetch( ptr + i + 1 );
// some treatment on ptr and c
}
As you can see, I added a builtin_prefetch
instruction, hoping to put in cache the next iteration of my loop. I tried with different values : ptr+i+1
, ptr+i+2
, ptr+i+10
but nothing seems to change.
To measure performance, I use valgrind’s tool cachegrind, which gives me an indication of the number of cache misses. On the line c = *ptr
, cachegrind records 632,378 DLmr (L3 cache miss) when __builtin_prefetch
is not set. What’s weird though, is that this value does not change, regardless of the parameter I set to __builtin_prefetch
.
Any explanation to that?
That's because the hardware is years ahead of you. :)
There are hardware prefetchers that are designed to recognize simple patterns and do the prefetching for you. In this case, you have a simple sequential access pattern, that's more than trivial for the hardware prefetcher.
Manual prefetching only comes handy when you have access patterns that the hardware cannot predict.
Here's one such example: Prefetching Examples?
First of all smallest unit a cache deals is called cache line
and a cache line can be for example 64 byte long, but never as small as 1 byte. So when you ask for prefetch you need to ask a lot ahead of your current interest location. You need to know the cache line size, so you shouldn't ask for addresses located in same cache line. You also need to not to call prefetch too many times as that might also evict soon to be used cache lines as well as create a performance hit as instructions executed.
Modern architectures also have a concept of hardware prefetchers, which according to your access pattern can prefetch data for you in advance. This should be most of the time create data access times as good as your simple prefetches. Nowadays SW prefetching can only help you, if you can find a place so obvious to prefetch data - not by randomly spreading in the code. For example before starting processing a piece of data, but this can't help you if you just call prefetch and access data immediately. You need to do this early enough and do other setup work before accessing your data.
I suggest anyone interested in such topic to read The Software Optimization Cookbook. I generally deal with ARM architecture but I found this book invaluable. There are also some excerpts related to this question available online; see #1 and #2.
The right answer is: prefetching cannot change number of cache misses, it just forces them to occur earlier :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With