How do I gain measurable benefit from prefetch intrinsics?

Tags:

Using gcc 4.4.5 (yeah... I know it's old) on x86_64. Limited to SSE2 (or earlier) instructions for compatibility reasons.

I have what I think should be a textbook case for gaining big benefits from prefetching. I have an array ("A") of 32-bit elements, which are not (and cannot be) in sequential order. These 32-bit elements are indices into a larger data array ("D") of __m128i data. For each element of "A", I need to fetch the __m128i data from the appropriate location in "D", perform an operation on it, and store it back into the same location in "D". Actually each "entry" in D is "SOME_CONST" __m128i's big. So if the value in A is "1", the index into D is D[1 * SOME_CONST].

Since consecutive elements in "A" will almost never point to sequential locations in "D", I tend to think that the hardware prefetcher is going to struggle or fail to accomplish anything useful.

However, I can predict which locations I will be accessing next very easily, simply by looking ahead in "A". Enough verbiage... here's some code. The operation I'm performing on the data is to take the lower 64-bits of the __m128i and clone it to the upper 64-bits of the same. First the basic loop, no frills...

// SOME_CONST is either 3 or 4, but this "operation" only needs to happen for 3

for ( i=0; i<arraySize; ++i )
{
  register __m128i *dPtr = D + (A[i] * SOME_CONST);
  dPtr[0] = _mm_shuffle_epi32( dPtr[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[1] = _mm_shuffle_epi32( dPtr[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[2] = _mm_shuffle_epi32( dPtr[2], 0 | (1<<2) | (0<<4) | (1<<6) );

  // The immediate operand selects:
  // Bits 0-31   = bits 0-31
  // Bits 32-63  = bits 32-63
  // Bits 64-95  = bits 0-31
  // Bits 96-127 = bits 32-63

  // If anyone is more clever than me and knows of a better way to do this in SSE2,
  //  bonus points.  ;-)
}

I've tried a number of different ways to sprinkle prefetch instrinsics in there, and none have resulted in any kind of speedup yet. For example, I tried unrolling the loop to have a stride of 2 or even 4 elements, but it didn't help...

// Assume the "A" array size is appropriately padded so that overruns don't
//  result in SIGSEGV accessing uninitialized memory in D.

register __m128i *dPtr0, *dPtr1, *dPtr2, *dPtr3, *dPtr4, *dPtr5, *dPtr6, *dPtr7;
dPtr4 = D + (A[0] * SOME_CONST);
dPtr5 = D + (A[1] * SOME_CONST);
dPtr6 = D + (A[2] * SOME_CONST);
dPtr7 = D + (A[3] * SOME_CONST);

for ( i=0; i<arraySize; i+=4 )
{
  dPtr0 = dPtr4;
  dPtr1 = dPtr5;
  dPtr2 = dPtr6;
  dPtr3 = dPtr7;

  dPtr4 = D + (A[i+4] * SOME_CONST);
  _mm_prefetch( dPtr4, _MM_HINT_NTA );
  _mm_prefetch( dPtr4+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr5 = D + (A[i+5] * SOME_CONST);
  _mm_prefetch( dPtr5, _MM_HINT_NTA );
  _mm_prefetch( dPtr5+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr6 = D + (A[i+6] * SOME_CONST);
  _mm_prefetch( dPtr6, _MM_HINT_NTA );
  _mm_prefetch( dPtr6+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr7 = D + (A[i+7] * SOME_CONST);
  _mm_prefetch( dPtr7, _MM_HINT_NTA );
  _mm_prefetch( dPtr7+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr0[0] = _mm_shuffle_epi32( dPtr0[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr0[1] = _mm_shuffle_epi32( dPtr0[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr0[2] = _mm_shuffle_epi32( dPtr0[2], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr1[0] = _mm_shuffle_epi32( dPtr1[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr1[1] = _mm_shuffle_epi32( dPtr1[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr1[2] = _mm_shuffle_epi32( dPtr1[2], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr2[0] = _mm_shuffle_epi32( dPtr2[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr2[1] = _mm_shuffle_epi32( dPtr2[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr2[2] = _mm_shuffle_epi32( dPtr2[2], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr3[0] = _mm_shuffle_epi32( dPtr3[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr3[1] = _mm_shuffle_epi32( dPtr3[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr3[2] = _mm_shuffle_epi32( dPtr3[2], 0 | (1<<2) | (0<<4) | (1<<6) );
}

That is the 4 element version, but I also tried with only 2 in case that was too much data to be sloshing around. Also I tried using _MM_HINT_NTA and _MM_HINT_T0. No appreciable difference somehow.

I also tried a simpler variant, which just attempts to put as much space as seemed reasonable between the prefetch and when it would be used:

#define PREFETCH_DISTANCE 10
// trying 5 overnight, will see results tomorrow...

for ( i=0; i<arraySize; ++i )
{
  register __m128i *dPtrFuture, *dPtr;
  dPtrFuture = D + (A[i + PREFETCH_DISTANCE] * SOME_CONST);
  _mm_prefetch( dPtrFuture, _MM_HINT_NTA );      // tried _MM_HINT_T0 too
  _mm_prefetch( dPtrFuture + 1, _MM_HINT_NTA );  // tried _MM_HINT_T0 too

  dPtr = D + (A[i] * SOME_CONST);
  dPtr[0] = _mm_shuffle_epi32( dPtr[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[1] = _mm_shuffle_epi32( dPtr[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[2] = _mm_shuffle_epi32( dPtr[2], 0 | (1<<2) | (0<<4) | (1<<6) );
}

Initially I expect this code to stall, but once it gets "PREFETCH_DISTANCE" in to the loop, I was hoping it would see a good enough speed boost. Most of these variants cause no more than .2 seconds of runtime difference over millions of iterations which take a total CPU time of 4m:30s on this particular machine (which is stone idle other than me). The differences seem to be indistinguishable from "noise" in the data.

Am I correct in my assumption that prefetching should be able to help me here? If so, what am I doing wrong?

All helpful and interesting thoughts are appreciated.

EDIT:

I created a contrived example that truly randomizes the data in A. I played with the buffer sizes from 64MB up to 6400MB. I found that I get enormous gain out of unrolling the loop and pre-calculating the addresses of the next 4 elements as I'm performing my operation on the current 4. But my runtime tanks by a factor of >10x if I try to prefetch any of the addresses that I pre-calculated. I'm really scratching my head on this one. My standalone contrived code is:

#include <xmmintrin.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

#define QUEUE_ELEMENTS    1048576
#define DATA_ELEMENT_SIZE 4 * sizeof( __m128i )
#define DATA_ELEMENTS     QUEUE_ELEMENTS

#define QUEUE_ITERATIONS  100000
#define LOOP_UNROLL_4
#define LOOP_UNROLL_2

#ifdef LOOP_UNROLL_4
  #define UNROLL_CONST 4
#else
  #ifdef LOOP_UNROLL_2
    #define UNROLL_CONST 2
  #else
    #define UNROLL_CONST 1
  #endif
#endif

int main( void )
{
  unsigned long long randTemp;
  unsigned long i, outerLoop;
  unsigned long *workQueue;
  __m128i *data, *dataOrig;
  clock_t timeStamp;

  workQueue = malloc( QUEUE_ELEMENTS * sizeof( unsigned long ) );

  dataOrig = malloc( (DATA_ELEMENTS * DATA_ELEMENT_SIZE) + 2 );
  if ( (unsigned long long) dataOrig & 0xf )
  {
    data = (__m128i *) (((unsigned long long) dataOrig & ~0xf) + 0x10);
    // force 16-byte (128-bit) alignment
  } else data = dataOrig;

  // Not initializing data, because its contents isn't important.

  for ( i=0; i<QUEUE_ELEMENTS; ++i )
  {
    randTemp = (unsigned long long)rand() *
     (unsigned long long) QUEUE_ELEMENTS / (unsigned long long) RAND_MAX;
    workQueue[i] = (unsigned long) randTemp;
  }

  printf( "Starting work...\n" );
  // Actual work happening below... start counting.
  timeStamp = clock();

  for ( outerLoop = 0; outerLoop < QUEUE_ITERATIONS; ++outerLoop )
  {
    register __m128i *dataPtr0, *dataPtr1, *dataPtr2, *dataPtr3;
    register __m128i *dataPtr4, *dataPtr5, *dataPtr6, *dataPtr7;

    #ifdef LOOP_UNROLL_2
      dataPtr4 = data + (workQueue[0] * DATA_ELEMENT_SIZE);
      dataPtr5 = data + (workQueue[1] * DATA_ELEMENT_SIZE);
    #endif
    #ifdef LOOP_UNROLL_4
      dataPtr6 = data + (workQueue[2] * DATA_ELEMENT_SIZE);
      dataPtr7 = data + (workQueue[3] * DATA_ELEMENT_SIZE);
    #endif

    for ( i=0; i<QUEUE_ELEMENTS; i+=UNROLL_CONST )
    {
      #ifdef LOOP_UNROLL_2
        dataPtr0 = dataPtr4;
        dataPtr4 = data + (workQueue[i+4] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr4, _MM_HINT_T0 );
        dataPtr1 = dataPtr5;
        dataPtr5 = data + (workQueue[i+5] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr5, _MM_HINT_T0 );
      #endif
      #ifdef LOOP_UNROLL_4
        dataPtr2 = dataPtr6;
        dataPtr6 = data + (workQueue[i+6] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr6, _MM_HINT_T0 );
        dataPtr3 = dataPtr7;
        dataPtr7 = data + (workQueue[i+7] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr7, _MM_HINT_T0 );
      #endif
      #if !defined( LOOP_UNROLL_2 ) && !defined( LOOP_UNROLL_4 )
        dataPtr0 = data + (workQueue[i] * DATA_ELEMENT_SIZE);
      #endif

      _mm_shuffle_epi32( dataPtr0[0], 0 | (1<<2) | (0<<4) | (1<<6) );
      _mm_shuffle_epi32( dataPtr0[1], 0 | (1<<2) | (0<<4) | (1<<6) );
      _mm_shuffle_epi32( dataPtr0[2], 0 | (1<<2) | (0<<4) | (1<<6) );
      // Per original code, no need to perform operation on dataPtrx[3]

      #ifdef LOOP_UNROLL_2
        _mm_shuffle_epi32( dataPtr1[0], 0 | (1<<2) | (0<<4) | (1<<6) );
        _mm_shuffle_epi32( dataPtr1[1], 0 | (1<<2) | (0<<4) | (1<<6) );
        _mm_shuffle_epi32( dataPtr1[2], 0 | (1<<2) | (0<<4) | (1<<6) );
      #endif
      #ifdef LOOP_UNROLL_4
        _mm_shuffle_epi32( dataPtr2[0], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr2[1], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr2[2], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr3[0], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr3[1], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr3[2], 0 | (1<<2) | (0<<4) | (1<<6) );  
      #endif
    }
    if ( (outerLoop % 1000) == 0 ) { putchar( '.' ); fflush( stdout ); }
  }

  timeStamp = clock() - timeStamp;
  printf( "\nRun was %lu seconds.\n", timeStamp / CLOCKS_PER_SEC );

  free( dataOrig );
  free( workQueue );

  return 0;
}

My runtime with LOOP_UNROLL_2 is 40 seconds.
My runtime with LOOP_UNROLL_4 is 20 seconds.
My runtime without any unrolling is 80 seconds.
When I enable the prefetches, the runtime is so long that I just kill the process rather than wait for it.

I even wrote an 8x unrolled loop, and it still scaled perfectly to 10 seconds of runtime. I'm amazed it didn't saturate there because I would certainly run out of registers at that point, holding 16 unique pointers. So what can I learn from this? That my inner loop code is so tight that it is greatly overshadowed by the overhead in the loop construct itself? Are there any bugs in this code that I'm not seeing? All builds were with gcc -O2.

608

asked Sep 20 '14 08:09

Marty

1 Answers

If your data resides in memory, don't expect much of a speedup; prefetching from memory has very little available to improve.

With 150 ns cycle time time, 64 byte cache lines and 4GB/sec streaming transfer rate (my AMD system; Intel is faster), and using 48 bytes (3 x 128 bits) of each 64-byte cache line read, the system is fetching 320 MB of usable data per second. Prefetching brings the rate closer to the peak of 4000 MB/s. The total savings available to prefetching is .92 seconds for every 320 MB read. At 320 MB/s, 270 seconds (4m 30s) is 840 GB worth of memory transfer time; the progran is probably not spending more than a tiny fraction of this (<1%, 8GB) on actually reading memory. Completely eliminating memory i/o would save... 1% of the runtime.

The drawback to too much prefetching is that prefetched data displaces useful things from the really fast but really small memory close to the cpu (level 1 and level 2 caches, kilobytes not megabytes). This may account for the pessimal performance of some of the test runs.

That unrolling the loop sped up the program but prefetching didn't also suggests that the processing itself is the primary bottleneck, not the data movement.

114

answered Sep 28 '22 03:09

Andras

Related questions
                            
                                getattr() versus dict lookup, which is faster?
                            
                                Update the average of a continuous sequence of numbers in constant time
                            
                                Fastest way to write cells to Excel with Office Interop?
                            
                                MySQL indexes - how many are enough?
                            
                                Python @property versus method performance - which one to use?
                            
                                What are the benefits of using semantic HTML?
                            
                                PHP try-catch block inside loop [duplicate]
                            
                                Measuring exception handling overhead in C++
                            
                                Why is IronPython faster than the Official Python Interpreter
                            
                                Alternative to if, else if
                            
                                C# - For vs Foreach - Huge performance difference
                            
                                Overhead of implementing an interface
                            
                                Will multi threading provide any performance boost?
                            
                                Speeding Up Java
                            
                                App crashes when running maps activity
                            
                                What is the difference between latency and response time?
                            
                                What is more efficient, i++ or ++i? [duplicate]
                            
                                MongoDB performance of db.eval() in javascript shell vs. pymongo
                            
                                jQuery datatables optimization in IE, 96.7% of time spent in offsetWidth

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I gain measurable benefit from prefetch intrinsics?

Tags:

performance

x86-64

simd

sse

prefetch

Marty

People also ask

1 Answers

Andras

Recent Activity

Donate For Us