I have loop like this
start = __rdtsc();
unsigned long long count = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < M; j++)
count += tab[i][j];
stop = __rdtsc();
time = (stop - start) * 1/3;
Need to check how prefetch data influences on efficiency. How to force prefetch some values from memory into cache before they will be counted?
For GCC only:
__builtin_prefetch((const void*)(prefetch_address),0,0);
prefetch_address
can be invalid, there will be no segfault. If there too small difference between prefetch_address
and current location, there might be no effect or even slowdown. Try to set it at least 1k ahead.
First, I suppose that tab
is a large 2D array such as a static array (e.g., int tab[1024*1024][1024*1024]
) or a dynamically-allocated array (e.g., int** tab
and following malloc
s). Here, you want to prefetch some data from tab
to the cache to reduce the execution time.
Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.
Two facts you should know for this problem:
(1) You are already exploit the spatial locality of tab
inside of the innermost loop. Once tab[i][0]
is read (after a cache miss, or a page fault), the data from tab[i][0]
to tab[i][15]
will be in your CPU caches, assuming that the cache line size is 64 bytes.
(2) However, when the code traverses in the row, i.e., tab[i][M-1]
to tab[i+1][0]
, it is highly likely to happen a cold cache miss, especially when tab
is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.
So, prefetching makes a sense only when you read (1) the first item of the next row and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0])
ahead of time.
You may do so by inserting a prefetch operation (e.g., __builtin_prefetch
) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.
However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.
If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With