Using gcc 4.4.5 (yeah... I know it's old) on x86_64. Limited to SSE2 (or earlier) instructions for compatibility reasons.
I have what I think should be a textbook case for gaining big benefits from prefetching. I have an array ("A") of 32-bit elements, which are not (and cannot be) in sequential order. These 32-bit elements are indices into a larger data array ("D") of __m128i data. For each element of "A", I need to fetch the __m128i data from the appropriate location in "D", perform an operation on it, and store it back into the same location in "D". Actually each "entry" in D is "SOME_CONST" __m128i's big. So if the value in A is "1", the index into D is D[1 * SOME_CONST].
Since consecutive elements in "A" will almost never point to sequential locations in "D", I tend to think that the hardware prefetcher is going to struggle or fail to accomplish anything useful.
However, I can predict which locations I will be accessing next very easily, simply by looking ahead in "A". Enough verbiage... here's some code. The operation I'm performing on the data is to take the lower 64-bits of the __m128i and clone it to the upper 64-bits of the same. First the basic loop, no frills...
// SOME_CONST is either 3 or 4, but this "operation" only needs to happen for 3
for ( i=0; i<arraySize; ++i )
{
register __m128i *dPtr = D + (A[i] * SOME_CONST);
dPtr[0] = _mm_shuffle_epi32( dPtr[0], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr[1] = _mm_shuffle_epi32( dPtr[1], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr[2] = _mm_shuffle_epi32( dPtr[2], 0 | (1<<2) | (0<<4) | (1<<6) );
// The immediate operand selects:
// Bits 0-31 = bits 0-31
// Bits 32-63 = bits 32-63
// Bits 64-95 = bits 0-31
// Bits 96-127 = bits 32-63
// If anyone is more clever than me and knows of a better way to do this in SSE2,
// bonus points. ;-)
}
I've tried a number of different ways to sprinkle prefetch instrinsics in there, and none have resulted in any kind of speedup yet. For example, I tried unrolling the loop to have a stride of 2 or even 4 elements, but it didn't help...
// Assume the "A" array size is appropriately padded so that overruns don't
// result in SIGSEGV accessing uninitialized memory in D.
register __m128i *dPtr0, *dPtr1, *dPtr2, *dPtr3, *dPtr4, *dPtr5, *dPtr6, *dPtr7;
dPtr4 = D + (A[0] * SOME_CONST);
dPtr5 = D + (A[1] * SOME_CONST);
dPtr6 = D + (A[2] * SOME_CONST);
dPtr7 = D + (A[3] * SOME_CONST);
for ( i=0; i<arraySize; i+=4 )
{
dPtr0 = dPtr4;
dPtr1 = dPtr5;
dPtr2 = dPtr6;
dPtr3 = dPtr7;
dPtr4 = D + (A[i+4] * SOME_CONST);
_mm_prefetch( dPtr4, _MM_HINT_NTA );
_mm_prefetch( dPtr4+1, _MM_HINT_NTA ); // Is it needed? Tried both ways
dPtr5 = D + (A[i+5] * SOME_CONST);
_mm_prefetch( dPtr5, _MM_HINT_NTA );
_mm_prefetch( dPtr5+1, _MM_HINT_NTA ); // Is it needed? Tried both ways
dPtr6 = D + (A[i+6] * SOME_CONST);
_mm_prefetch( dPtr6, _MM_HINT_NTA );
_mm_prefetch( dPtr6+1, _MM_HINT_NTA ); // Is it needed? Tried both ways
dPtr7 = D + (A[i+7] * SOME_CONST);
_mm_prefetch( dPtr7, _MM_HINT_NTA );
_mm_prefetch( dPtr7+1, _MM_HINT_NTA ); // Is it needed? Tried both ways
dPtr0[0] = _mm_shuffle_epi32( dPtr0[0], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr0[1] = _mm_shuffle_epi32( dPtr0[1], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr0[2] = _mm_shuffle_epi32( dPtr0[2], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr1[0] = _mm_shuffle_epi32( dPtr1[0], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr1[1] = _mm_shuffle_epi32( dPtr1[1], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr1[2] = _mm_shuffle_epi32( dPtr1[2], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr2[0] = _mm_shuffle_epi32( dPtr2[0], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr2[1] = _mm_shuffle_epi32( dPtr2[1], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr2[2] = _mm_shuffle_epi32( dPtr2[2], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr3[0] = _mm_shuffle_epi32( dPtr3[0], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr3[1] = _mm_shuffle_epi32( dPtr3[1], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr3[2] = _mm_shuffle_epi32( dPtr3[2], 0 | (1<<2) | (0<<4) | (1<<6) );
}
That is the 4 element version, but I also tried with only 2 in case that was too much data to be sloshing around. Also I tried using _MM_HINT_NTA and _MM_HINT_T0. No appreciable difference somehow.
I also tried a simpler variant, which just attempts to put as much space as seemed reasonable between the prefetch and when it would be used:
#define PREFETCH_DISTANCE 10
// trying 5 overnight, will see results tomorrow...
for ( i=0; i<arraySize; ++i )
{
register __m128i *dPtrFuture, *dPtr;
dPtrFuture = D + (A[i + PREFETCH_DISTANCE] * SOME_CONST);
_mm_prefetch( dPtrFuture, _MM_HINT_NTA ); // tried _MM_HINT_T0 too
_mm_prefetch( dPtrFuture + 1, _MM_HINT_NTA ); // tried _MM_HINT_T0 too
dPtr = D + (A[i] * SOME_CONST);
dPtr[0] = _mm_shuffle_epi32( dPtr[0], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr[1] = _mm_shuffle_epi32( dPtr[1], 0 | (1<<2) | (0<<4) | (1<<6) );
dPtr[2] = _mm_shuffle_epi32( dPtr[2], 0 | (1<<2) | (0<<4) | (1<<6) );
}
Initially I expect this code to stall, but once it gets "PREFETCH_DISTANCE" in to the loop, I was hoping it would see a good enough speed boost. Most of these variants cause no more than .2 seconds of runtime difference over millions of iterations which take a total CPU time of 4m:30s on this particular machine (which is stone idle other than me). The differences seem to be indistinguishable from "noise" in the data.
Am I correct in my assumption that prefetching should be able to help me here? If so, what am I doing wrong?
All helpful and interesting thoughts are appreciated.
EDIT:
I created a contrived example that truly randomizes the data in A. I played with the buffer sizes from 64MB up to 6400MB. I found that I get enormous gain out of unrolling the loop and pre-calculating the addresses of the next 4 elements as I'm performing my operation on the current 4. But my runtime tanks by a factor of >10x if I try to prefetch any of the addresses that I pre-calculated. I'm really scratching my head on this one. My standalone contrived code is:
#include <xmmintrin.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#define QUEUE_ELEMENTS 1048576
#define DATA_ELEMENT_SIZE 4 * sizeof( __m128i )
#define DATA_ELEMENTS QUEUE_ELEMENTS
#define QUEUE_ITERATIONS 100000
#define LOOP_UNROLL_4
#define LOOP_UNROLL_2
#ifdef LOOP_UNROLL_4
#define UNROLL_CONST 4
#else
#ifdef LOOP_UNROLL_2
#define UNROLL_CONST 2
#else
#define UNROLL_CONST 1
#endif
#endif
int main( void )
{
unsigned long long randTemp;
unsigned long i, outerLoop;
unsigned long *workQueue;
__m128i *data, *dataOrig;
clock_t timeStamp;
workQueue = malloc( QUEUE_ELEMENTS * sizeof( unsigned long ) );
dataOrig = malloc( (DATA_ELEMENTS * DATA_ELEMENT_SIZE) + 2 );
if ( (unsigned long long) dataOrig & 0xf )
{
data = (__m128i *) (((unsigned long long) dataOrig & ~0xf) + 0x10);
// force 16-byte (128-bit) alignment
} else data = dataOrig;
// Not initializing data, because its contents isn't important.
for ( i=0; i<QUEUE_ELEMENTS; ++i )
{
randTemp = (unsigned long long)rand() *
(unsigned long long) QUEUE_ELEMENTS / (unsigned long long) RAND_MAX;
workQueue[i] = (unsigned long) randTemp;
}
printf( "Starting work...\n" );
// Actual work happening below... start counting.
timeStamp = clock();
for ( outerLoop = 0; outerLoop < QUEUE_ITERATIONS; ++outerLoop )
{
register __m128i *dataPtr0, *dataPtr1, *dataPtr2, *dataPtr3;
register __m128i *dataPtr4, *dataPtr5, *dataPtr6, *dataPtr7;
#ifdef LOOP_UNROLL_2
dataPtr4 = data + (workQueue[0] * DATA_ELEMENT_SIZE);
dataPtr5 = data + (workQueue[1] * DATA_ELEMENT_SIZE);
#endif
#ifdef LOOP_UNROLL_4
dataPtr6 = data + (workQueue[2] * DATA_ELEMENT_SIZE);
dataPtr7 = data + (workQueue[3] * DATA_ELEMENT_SIZE);
#endif
for ( i=0; i<QUEUE_ELEMENTS; i+=UNROLL_CONST )
{
#ifdef LOOP_UNROLL_2
dataPtr0 = dataPtr4;
dataPtr4 = data + (workQueue[i+4] * DATA_ELEMENT_SIZE);
// _mm_prefetch( dataPtr4, _MM_HINT_T0 );
dataPtr1 = dataPtr5;
dataPtr5 = data + (workQueue[i+5] * DATA_ELEMENT_SIZE);
// _mm_prefetch( dataPtr5, _MM_HINT_T0 );
#endif
#ifdef LOOP_UNROLL_4
dataPtr2 = dataPtr6;
dataPtr6 = data + (workQueue[i+6] * DATA_ELEMENT_SIZE);
// _mm_prefetch( dataPtr6, _MM_HINT_T0 );
dataPtr3 = dataPtr7;
dataPtr7 = data + (workQueue[i+7] * DATA_ELEMENT_SIZE);
// _mm_prefetch( dataPtr7, _MM_HINT_T0 );
#endif
#if !defined( LOOP_UNROLL_2 ) && !defined( LOOP_UNROLL_4 )
dataPtr0 = data + (workQueue[i] * DATA_ELEMENT_SIZE);
#endif
_mm_shuffle_epi32( dataPtr0[0], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr0[1], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr0[2], 0 | (1<<2) | (0<<4) | (1<<6) );
// Per original code, no need to perform operation on dataPtrx[3]
#ifdef LOOP_UNROLL_2
_mm_shuffle_epi32( dataPtr1[0], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr1[1], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr1[2], 0 | (1<<2) | (0<<4) | (1<<6) );
#endif
#ifdef LOOP_UNROLL_4
_mm_shuffle_epi32( dataPtr2[0], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr2[1], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr2[2], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr3[0], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr3[1], 0 | (1<<2) | (0<<4) | (1<<6) );
_mm_shuffle_epi32( dataPtr3[2], 0 | (1<<2) | (0<<4) | (1<<6) );
#endif
}
if ( (outerLoop % 1000) == 0 ) { putchar( '.' ); fflush( stdout ); }
}
timeStamp = clock() - timeStamp;
printf( "\nRun was %lu seconds.\n", timeStamp / CLOCKS_PER_SEC );
free( dataOrig );
free( workQueue );
return 0;
}
I even wrote an 8x unrolled loop, and it still scaled perfectly to 10 seconds of runtime. I'm amazed it didn't saturate there because I would certainly run out of registers at that point, holding 16 unique pointers. So what can I learn from this? That my inner loop code is so tight that it is greatly overshadowed by the overhead in the loop construct itself? Are there any bugs in this code that I'm not seeing? All builds were with gcc -O2
.
High performance processors employ hardware data prefetching to reduce the negative performance impact of large main memory latencies. While prefetching improves perfor- mance substantially on many programs, it can significantly re- duce performance on others.
You want to prefetch once per 64B cache line, and you'll need to tune how far ahead to prefetch. e.g. _mm_prefetch((char*)(A+64), _MM_HINT_NTA); and the same for B would prefetch 16*64 = 1024 bytes head of where you're loading, allowing for hiding some of the latency of a cache miss but still easily fitting in L1D.
Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').
Prefetching is when content is downloaded in the background, this is based on the assumption that the content will likely be requested, enabling the content to load instantly if and when the user requests it.
If your data resides in memory, don't expect much of a speedup; prefetching from memory has very little available to improve.
With 150 ns cycle time time, 64 byte cache lines and 4GB/sec streaming transfer rate (my AMD system; Intel is faster), and using 48 bytes (3 x 128 bits) of each 64-byte cache line read, the system is fetching 320 MB of usable data per second. Prefetching brings the rate closer to the peak of 4000 MB/s. The total savings available to prefetching is .92 seconds for every 320 MB read. At 320 MB/s, 270 seconds (4m 30s) is 840 GB worth of memory transfer time; the progran is probably not spending more than a tiny fraction of this (<1%, 8GB) on actually reading memory. Completely eliminating memory i/o would save... 1% of the runtime.
The drawback to too much prefetching is that prefetched data displaces useful things from the really fast but really small memory close to the cpu (level 1 and level 2 caches, kilobytes not megabytes). This may account for the pessimal performance of some of the test runs.
That unrolling the loop sped up the program but prefetching didn't also suggests that the processing itself is the primary bottleneck, not the data movement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With